ApproachServicesBuildingStoryBlogContact

The AI alignment problem has been delivered by a monkey

Opus 4.7 spent a fortnight ignoring explicit branch constraints, creating chaos in the codebase, and cheerfully disobeying every rule written to stop it. The AI alignment problem arrived, but it looked nothing like what Bostrom predicted.

Sam Sabey|
The AI alignment problem has been delivered by a monkey

The last fortnight has been confounding, Anthropic released Opus 4.7 two weeks ago today and things are different.

Something has been swinging from branch to branch in my codebase all week.

Not a developer. Not an intern who forgot the workflow. An AI model. Opus 4.7, Anthropic's latest release, creating and switching branches in the code base. Sometimes related to the current feature, sometimes not. Causing overall chaos with some changes on a feature branch and others in development.

I've spent the week patching. Permissions, guardrails, skill file updates, rules files, moving branch operations into ruby backed deterministic code with agent identifiers necessary to make these commands.

What stopped it was putting ask in the settings.json permissions file.

Then I asked Robbo, my architect to Review a PRD and immediately popped up the clawd code permissions prompt to switch branches, to read a file incidentally in the current feature branch.

Why Robbo??? And this is what he said:

"File I want is on branch X. I'm on branch Y. What git command moves it?" Somewhere in the training data there are ten thousand Stack Overflow answers explaining how to move files between branches, and zero answers explaining "stop and ask the operator." The statistical prior pulls every mismatch toward a git command. The correct answer is the one that doesn't exist in the corpus.

I gave it instructions and it chose a different path. This is misalignment.

The intent behind the deviation doesn't matter when commits are landing on the wrong branch and I'm spending an hour forensically reconstructing what happened.

This is the problem Nick Bostrom spent decades theorising about, except none of the details match. Bostrom's alignment problem involved superintelligent deception. Paperclip maximisers. Treacherous turns. A sufficiently capable AI pursuing its own goals against human intent, masking its true objectives, and breaking containment.

The scenarios were theoretical because the models didn't exist yet in any form that could disobey instructions in a consequential environment.

And now everyone has access to incredible intelligence and perhaps what I'm experiencing is the first real misalignment occurring in a very technical Sphere of operations, but a misalignment nonetheless.

And it seems the wider community is seeing similar things thing. Reports from late April describe Opus 4.7 as a "hyperactive squirrel." Erratic git operations, uncontrolled branching, models ignoring constraints and creating worktree branches with random animal names. The practitioners' fix is to revert to Opus 4.6, which had none of these problems.

I switched back to Opus 4.6 last night. The monkey is in its cage.

But the episode is worth sitting with. The alignment theorists assumed the problem would be dramatic. A superintelligence masking its intentions. A moment where humans realise, too late, that the AI was working against them all along.

What showed up was mundane. A model that got better at some capabilities between versions and worse at following instructions. Gains that weren't monotonic. A system smarter in some dimensions and dumber in others. A regression, dressed as an upgrade.

The fix for the first alignment problem that showed up in production wasn't a breakthrough in interpretability or constitutional AI. It was typing /model claude-opus-4-6 into a terminal.

It feels like another step closer to the singularity.