PyAI · March 10, 2026 · San Francisco

Leaning In to Find Out

Armin Ronacher · Co-Founder, Earendil

Who Am I?

Armin Ronacher, co-founder at Earendil
Spent 10 years making Sentry a great product
Created a lot of Open Source software: Flask, Jinja2, and others
I write at lucumr.pocoo.org about agentic engineering

I Am Going to Speculate a Little Here.

But Hopefully in a Useful Direction.

We Are All Somewhat Beholden to Foundation Model Providers.

That Means We Have to Learn From Behavior, Not From Internals.

The Weird Part

As user we do not control the training process
We rarely know what the models were optimized for
What if the models change and our uses perform badly?

So the Job Is Not to Guess.

It Is to Find Out.

To Find Out

With a Way to Extrapolate.

Basics

Understand Reinforcement Learning

A pretrained model gives you a broad prior
RL then samples behavior and attaches a reward signal
Good trajectories get reinforced, bad ones get suppressed
Over time, the model becomes more like the behavior that keeps scoring well

What Do the Model Providers Train On?

In short: entire sessions they get access to.

prompts, responses, tool calls, edits, retries, tests
ranking signals: task success, acceptance, abandonment
particularly things like Claude Code sessions where the loop is rich
As we vibe, we reinforce

Why That Matters

Models improve fastest where people use them a lot and the outcome is easy to judge.

High-volume successful workflows create pressure to improve there
Tasks with clear evaluators are easier to reinforce aggressively
Tool-using coding loops are unusually measurable

So How Do We Extrapolate?

Do not just ask what is popular
Ask where behavior is abundant and legible
Ask where outcomes can be scored cheaply
Ask which environments dominate those traces: shells, files, test loops

Gains from Coding Agents

Overrepresented in coding traces

Unix and shell workflows
files, paths, diffs, permissions
popular languages and build tools
tests, logs, stack traces
file system navigation and incremental edits

Underrepresented

GUI-heavy interaction
rare proprietary systems
domains with little trace volume
tasks that resist textual decomposition

Bonus Hunches

Coding agents increasingly bring screenshots into the loop
That means visual understanding in the context of code or code driven agent turns
And coding-adjacent tooling is becoming the base layer for workspace automation

Layers Above the Coding Substrate

The same agents now manipulate spreadsheets, Word documents, calendars, and similar workspace surfaces
They are also remote controlling tmux sessions and web browsers via CDP
So the coding-agent substrate is starting to absorb more and more of ordinary computer work

Surprises

Curious failure modes show up when important state changes are not visible in the session transcript
Agents often get very confused when state changes outside the turn and the harness does not inject those changes
If you tell them about a change and ask them to validate it, they often hallucinate and claim they called a tool when they did not

What This Means for Us

Do not just optimize for what the model can do today
Build around workflows with dense textual feedback
Expect the substrate of coding to keep improving: writing code, running it, files
That gives you a strategy that survives model drift for a while

Narrowing In on Coding Agents

Coding Is a Mode of Execution

If the file system works, and agents know how to program, you can route a surprising amount of work through that loop
Many "non-coding" tasks become tractable once they look like files, programs, tools, and outputs
That gives you leverage from the model's strongest priors instead of fighting them

Build Tiny Programmable Worlds

Use simple sandboxed languages with basic sandboxes
Prefer text files, explicit inputs, deterministic outputs, and good error messages
If the agent can iterate by editing, running, and reading failures, you inherit the coding-agent gradient

Alignment Beats Novelty

SQL beats a custom query DSL
JavaScript light beats a complicated expression language
Even a bespoke DSL can work if it aligns with a known language and produces excellent errors
The more it feels like programming, the more capability you get for free

Finding Out

So Now That We Have a Hypothesis, How Do We Find Out If It's Right?

Initial Pass, Then Iteration

Ask the model a few times how it would do the thing
Models have some weak introspectability about likely strategies and failure modes
Then feed agentic turns back into models so they can critique the interaction and suggest harness, prompt, and tool improvements

One Practical Loop

Run the same basic task twenty times with a coding agent
Use subagents or small variations if that helps you explore the space
Then let another LLM score the outcomes for quality, failure modes, and consistency

Use Your Own Traces

Your own coding sessions are a valuable input for judgement
Things like .{pi,claude,codex}/sessions accumulate a surprising amount of useful data
At the beginning, reduce variables: fewer tools, simpler harnesses, clearer prompts

The Trick Is Not Just to Find Out.

It Is to Find Out in a Way That Extrapolates.