Engineering at Earendil

Engineering for Breathing Room

Armin Ronacher

Background Reading

RFC 20

https://rfc.earendil.com/0020/

What We Optimize For

  • The goal is the right code that behaves predictably in production
  • Predictability determines whether incidents are manageable or chaotic
  • At 2AM, we should patch safely, go back to bed, and fix deeply in daylight
  • Great engineering feels calm under pressure, not heroic all the time

Core Thesis

Build systems so that we have breathing room.

When Incidents Start

  • If we do nothing for 10 minutes, the blast radius should stay bounded
  • Better: there should be a clear “stabilize” action to make things not worse
  • Examples: pause consumers, disable mutating writes, open circuit breakers
  • This requires deliberate design: kill switches, pausable queues, degraded modes

The Basics

  • SaaS systems maintain state
  • They accept inputs and emit outputs that mutate that state
  • So correctness of state transitions is the center of system design
  • Real conditions: delays, retries, concurrency, partial rollouts, failures

1Start With The Dumbest Solution First

  • Complexity is debt with compounding interest
  • One boring queue before a distributed orchestration graph
  • Add complexity only when reality gives a concrete failure signal
  • Prefer your own shitty code over other's complex dependencies

Example: Boring Redis Rate Limiter

const key = `hits:${userId}:${minuteBucket(now)}`;
const [[_, hits]] = await redis.multi().incr(key).expire(key, 60, "NX").exec();
if (hits > 100) throw new TooManyRequests();
return { ok: true };

2Learn To Design Your Data

  • Data model design is system design
  • Understand access patterns, growth, and hotspot risk upfront
  • Be explicit about partition/sharding boundaries early
  • You can postpone scaling work; you cannot postpone data shape forever

Data Shape Beats Smart Data Layers

  • No amount of data-layer cleverness can erase bad partition keys or fanout
  • Nature still applies: hotspots, skew, and unbounded joins will surface
  • Design for partitioning and access paths first; abstractions second

3Prefer Event-Based State Transitions

  • Represent state changes as explicit events
  • Replay and recovery stay possible
  • Delayed processing remains conceptually valid
  • Not universal, but usually the first model to consider

Example: Events Updating State (Server + Client)

const apply = {
  "cart.add": (s, e) => ({ ...s, qty: s.qty + e.qty }),
  "cart.remove": (s, e) => ({ ...s, qty: Math.max(0, s.qty - e.qty) }),
  "cart.checkout": (s) => ({ ...s, checkedOut: true })
};
const next = events.reduce((s, e) => apply[e.type](s, e), prev);

4Mind The Clock

  • “This always runs at 02:00” is not a safe invariant
  • Queues stall, jobs delay, incidents intervene
  • Wall-clock assumptions break correctness under backlog
  • Event data + transition logic survives delayed execution

Example: Freeze Time in the Event

const event = { type: "invoice.generate", occurredAt: clock.now(), accountId };
const refTime = ctx.referenceTime ?? event.occurredAt; // never Date.now()
if (isMonthBoundary(refTime, account.tz)) {
  await billing.createInvoice(accountId, refTime);
}

5Design Writes for Concurrency, Not Hope

  • “Probably non-conflicting” is not a strategy
  • Prefer exclusive writers where possible
  • Use safe primitives: locks, atomic updates, idempotency keys
  • Concurrent execution should resolve naturally, not accidentally

Example: SQL Lock + Idempotency Key

await db.query("SELECT value FROM counters WHERE id=$1 FOR UPDATE", [id]);
await db.query("UPDATE counters SET value = value + 1 WHERE id=$1", [id]);

const idemKey = `idem:${key}`;
const cached = await redis.get(idemKey);
if (cached) return JSON.parse(cached);

const rv = await op(...args);
await redis.set(idemKey, JSON.stringify(rv), "EX", 86400);
return rv;

6Assume Rollouts Are Non-Atomic

  • Schema, services, workers, and clients never update at once
  • Old and new code coexist during rollout windows
  • Sequence: migrate schema → deploy tolerant code → rely on new shape
  • Low-scale shortcuts eventually break with painful incidents

7Transform Data Once at the Boundary

  • Scattered defaulting creates divergent behavior
  • Centralize DB → app and app → DB transformations
  • Boundary code handles nullable/legacy fields once
  • Business logic should work on true, required application shapes

Example: Default Nullable Fields at Load

function loadUser(id): User {
  const [row] = db.query("SELECT * FROM users WHERE id = $1", [id]);
  return {
    id: row.id,
    plan: row.plan ?? "free",
    featureFlags: row.feature_flags ?? []
  };
}

8Make Observability First-Class

  • If you cannot see the system, you cannot operate it
  • At 2AM, answer quickly: what broke, who is impacted, since when
  • Structured logs, correlation IDs, latency metrics, retries/dead-letter counters
  • Every incident should improve instrumentation for next time

9Learn To Crash

  • Fail loudly on broken invariants
  • Restart from durable state
  • Make retries idempotent
  • Silent corruption is worse than a visible crash

Example: Crash on Invalid Startup Config

const config = loadConfigFile("./config.json");
if (!validateConfig(config)) {
  throw new Error("Fatal: invalid config, refusing to start");
}
startServer(config);

10Finish The Work

  • Half-done refactors are worse than no refactor
  • Always define end state, owner, and deadline
  • “Done” includes deleting the old path
  • Celebrate removal of complexity, not addition of optionality

Concrete Stabilize Controls

Data / write safety

  • Pause queue consumers
  • Disable mutating writes
  • Route traffic to safe read paths

Flow control

  • Open circuit breakers
  • Queue for replay later (DLQ)
  • Stop scheduler triggers safely

QoS and Queues

During quality disruption

  • Backlogs accumulate faster than humans can wait
  • Old queue items often become irrelevant before they are processed
  • Queue depth alone is not customer value

Operational response

  • Prefer reverse queue processing (newest-first) when catching up
  • Apply load shedding / TTLs for stale work you should stop doing
  • Protect current user experience, then drain or drop the tail intentionally

The point is not perfect uptime.

The point is controllable failure.

Predictability creates confidence.

Confidence creates breathing room.