Engineering at Earendil

Engineering for Breathing Room

Armin Ronacher

Background Reading

RFC 20

https://rfc.earendil.com/0020/

What We Optimize For

The goal is the right code that behaves predictably in production
Predictability determines whether incidents are manageable or chaotic
At 2AM, we should patch safely, go back to bed, and fix deeply in daylight
Great engineering feels calm under pressure, not heroic all the time

Core Thesis

Build systems so that we have breathing room.

When Incidents Start

If we do nothing for 10 minutes, the blast radius should stay bounded
Better: there should be a clear “stabilize” action to make things not worse
Examples: pause consumers, disable mutating writes, open circuit breakers
This requires deliberate design: kill switches, pausable queues, degraded modes

The Basics

SaaS systems maintain state
They accept inputs and emit outputs that mutate that state
So correctness of state transitions is the center of system design
Real conditions: delays, retries, concurrency, partial rollouts, failures

1Start With The Dumbest Solution First

Complexity is debt with compounding interest
One boring queue before a distributed orchestration graph
Add complexity only when reality gives a concrete failure signal
Prefer your own shitty code over other's complex dependencies

Example: Boring Redis Rate Limiter

const key = `hits:${userId}:${minuteBucket(now)}`;
const [[_, hits]] = await redis.multi().incr(key).expire(key, 60, "NX").exec();
if (hits > 100) throw new TooManyRequests();
return { ok: true };

2Learn To Design Your Data

Data model design is system design
Understand access patterns, growth, and hotspot risk upfront
Be explicit about partition/sharding boundaries early
You can postpone scaling work; you cannot postpone data shape forever

Data Shape Beats Smart Data Layers

No amount of data-layer cleverness can erase bad partition keys or fanout
Nature still applies: hotspots, skew, and unbounded joins will surface
Design for partitioning and access paths first; abstractions second

3Prefer Event-Based State Transitions

Represent state changes as explicit events
Replay and recovery stay possible
Delayed processing remains conceptually valid
Not universal, but usually the first model to consider

Example: Events Updating State (Server + Client)

const apply = {
  "cart.add": (s, e) => ({ ...s, qty: s.qty + e.qty }),
  "cart.remove": (s, e) => ({ ...s, qty: Math.max(0, s.qty - e.qty) }),
  "cart.checkout": (s) => ({ ...s, checkedOut: true })
};
const next = events.reduce((s, e) => apply[e.type](s, e), prev);

4Mind The Clock

“This always runs at 02:00” is not a safe invariant
Queues stall, jobs delay, incidents intervene
Wall-clock assumptions break correctness under backlog
Event data + transition logic survives delayed execution

Example: Freeze Time in the Event

const event = { type: "invoice.generate", occurredAt: clock.now(), accountId };
const refTime = ctx.referenceTime ?? event.occurredAt; // never Date.now()
if (isMonthBoundary(refTime, account.tz)) {
  await billing.createInvoice(accountId, refTime);
}

5Design Writes for Concurrency, Not Hope

“Probably non-conflicting” is not a strategy
Prefer exclusive writers where possible
Use safe primitives: locks, atomic updates, idempotency keys
Concurrent execution should resolve naturally, not accidentally

Example: SQL Lock + Idempotency Key

await db.query("SELECT value FROM counters WHERE id=$1 FOR UPDATE", [id]);
await db.query("UPDATE counters SET value = value + 1 WHERE id=$1", [id]);

const idemKey = `idem:${key}`;
const cached = await redis.get(idemKey);
if (cached) return JSON.parse(cached);

const rv = await op(...args);
await redis.set(idemKey, JSON.stringify(rv), "EX", 86400);
return rv;

6Assume Rollouts Are Non-Atomic

Schema, services, workers, and clients never update at once
Old and new code coexist during rollout windows
Sequence: migrate schema → deploy tolerant code → rely on new shape
Low-scale shortcuts eventually break with painful incidents

7Transform Data Once at the Boundary

Scattered defaulting creates divergent behavior
Centralize DB → app and app → DB transformations
Boundary code handles nullable/legacy fields once
Business logic should work on true, required application shapes

Example: Default Nullable Fields at Load

function loadUser(id): User {
  const [row] = db.query("SELECT * FROM users WHERE id = $1", [id]);
  return {
    id: row.id,
    plan: row.plan ?? "free",
    featureFlags: row.feature_flags ?? []
  };
}

8Make Observability First-Class

If you cannot see the system, you cannot operate it
At 2AM, answer quickly: what broke, who is impacted, since when
Structured logs, correlation IDs, latency metrics, retries/dead-letter counters
Every incident should improve instrumentation for next time

9Learn To Crash

Fail loudly on broken invariants
Restart from durable state
Make retries idempotent
Silent corruption is worse than a visible crash

Example: Crash on Invalid Startup Config

const config = loadConfigFile("./config.json");
if (!validateConfig(config)) {
  throw new Error("Fatal: invalid config, refusing to start");
}
startServer(config);

10Finish The Work

Half-done refactors are worse than no refactor
Always define end state, owner, and deadline
“Done” includes deleting the old path
Celebrate removal of complexity, not addition of optionality

Concrete Stabilize Controls

Data / write safety

Pause queue consumers
Disable mutating writes
Route traffic to safe read paths

Flow control

Open circuit breakers
Queue for replay later (DLQ)
Stop scheduler triggers safely

QoS and Queues

During quality disruption

Backlogs accumulate faster than humans can wait
Old queue items often become irrelevant before they are processed
Queue depth alone is not customer value

Operational response

Prefer reverse queue processing (newest-first) when catching up
Apply load shedding / TTLs for stale work you should stop doing
Protect current user experience, then drain or drop the tail intentionally

The point is not perfect uptime.

The point is controllable failure.

Predictability creates confidence.

Confidence creates breathing room.