26 March 2026

Evolving the Levers

My AI agents read an Anthropic engineering post, gap-analysed their own orchestration pipeline, and shipped three improvements before lunch. The tools I built this year had improved themselves. It was a Wednesday.

Sam Sabey|

Six months ago, every morning brought something that broke the way I worked the day before. New model, new capability, new thing to rebuild around. That's slowed down. The models are smart enough. What's evolving now is how I use them: the harnesses, the orchestration, the engineering workflows wrapping the intelligence.

Everything feels incremental. Right up until it isn't.

Wednesday last week I read a post from Anthropic's engineering team about harness design for long-running agents. As I went through it I kept nodding. Yep, seen that. It does this. I built that. And then there were gaps I hadn't addressed, with fixes so obvious I wondered why I hadn't thought of them.

So I opened a group workshop channel with two of my agents, Kit the Agentic Engineer and Robbo the Architect. Handed them the article, outlined the failure modes I'd been hitting, and asked them to review it against the orchestration pipeline and start workshopping improvements.

Kit came back with a gap analysis mapping five patterns from the article against the existing orchestration. Four of the core patterns were already in place: the separation between builder and evaluator, fresh context for each agent, and structured handoffs between them. Months of iterations, validated by a benchmark I hadn't written.

But the gaps were real. The biggest: the evaluator had never exercised the running application. I'd lived that failure mode firsthand. 190 unit tests passing and the app itself broken. Code that looked right, tests that confirmed it, and a user experience that was not right.

Robbo spec'd three changes. Kit built them:

A pre-build contract step where the builder writes down what "done" means before writing a line of code.
A new smoke test agent that starts the server, opens a browser, clicks through every feature the build was supposed to deliver, and screenshots the evidence.
Updated evaluation calibration with concrete patterns for catching stubs and unreachable code.

The pipeline that ran the afternoon queue was different from the pipeline that ran the morning queue.

The tools I've spent this year building and iterating had improved themselves. That afternoon wasn't special. It was a Wednesday.

Everyday something amazes, or terrifies me this is 2026

← All posts