Research Paper

The Discipline Layer: Harnesses as the Missing Piece in Autonomous Coding

Introduction

If you've been working with AI agents on longer tasks, you've probably developed your own tricks for dealing with context window limits. Maybe you hit /summarize in Cursor when things get bloated or you ask the agent to write a summary.md file at the end of each session documenting what it did and what's left to do, then feed that to a fresh agent later. These workarounds... kind of work. They get you through the immediate problem.

But here's the thing: these are exactly that - workarounds. They help manage context, sure, but they don't solve the deeper issues. Agents still declare victory too early. They still skip proper testing. They still leave half-finished implementations when they run out of context. You can summarize all you want, but if the agent doesn't have structured ways to verify its work or track actual completion, you're just compressing the chaos.

Anthropic recently published how they built production-grade harnesses for long-running autonomous agents, and it's interesting because it shows the gap between "getting it to work once" and "building reliable, repeatable systems." This isn't about research or theory - it's engineering lessons from actually shipping autonomous coding agents. Let's dig into what they learned.

The Real Problem Isn't Just Memory

When you summarize context or write .md files tracking progress, you're solving one problem: memory loss between sessions. That's valid. But research into agent failures (a study analyzing 1,600+ agent traces) shows that's not the only problem, or even the biggest one:

System design issues (41.8%) - How roles, memory, and prompts are structured
Inter-agent problems (36.9%) - Agents failing to communicate or getting stuck in loops
Task verification failures (21.3%) - Agents assuming work is done without actually testing it

That last one is critical. An agent reading your summary.md file might understand what was done, but there's nothing forcing it to verify that work actually functions before moving on. It'll just trust the previous summary and keep building on potentially broken foundations.

Anthropic tested this with Claude Opus 4.5 building a clone of claude.ai. Even with their model, agents consistently failed in two specific ways:

The one-shot problem: Try to build everything at once, run out of context halfway, leave garbage everywhere. The next session has to waste time cleaning up instead of making progress.

The premature victory problem: Look around, see some stuff working, declare the project done - even though half the features are missing.

Manual summarization doesn't prevent either of these. You need structure that forces different behavior.

The Production Solution: Force Structure on the Agent

If you're pair programming in an IDE, your summary.md workarounds are probably fine. You're there to catch issues and guide the agent. But this is about autonomous scenarios - building full applications overnight, running multi-day research workflows, handling complex tasks without constant human oversight? That's where you need systematic architecture, not ad-hoc tricks. Whether you're building with the Anthropic API, creating custom agent workflows, or designing any system that needs reliability without human babysitting - these patterns matter.

Anthropic's fix isn't a smarter model. It's better structure around the model. They built what they call a "harness" - basically forcing strict project management on an AI that would otherwise just wing it.

The core idea: instead of one big goal ("build this app"), split the work between two types of agents with very specific jobs.

The Initializer Agent

The first session doesn't write any real code. It just sets everything up:

Creates a feature list - For the claude.ai clone, this was 200+ specific features in JSON, all marked as failing. This stops the agent from randomly deciding it's done.
Sets up git and a progress log - The git history and a simple progress.txt file become the agent's memory across sessions.
Writes a startup script - So future sessions don't waste time figuring out how to run the dev server.

This is just task decomposition - how humans normally write software. But turns out agents need this spelled out.

The Coding Agent

Every session after that follows the same workflow:

Read git logs and the progress file to understand what's been done
Start the dev server and run a basic test to make sure nothing is broken
Pick exactly ONE feature from the list
Implement it completely
Test it properly (with browser automation, not just unit tests)
Commit to git, update the progress file, mark the feature as done

The key difference from summarization: external verification is mandatory. The agent can't just say "I think this is done and hence works" and move on. It has to use browser automation to actually click buttons, type in forms, see results - test like a human user would.

When researchers compared approaches, they found that asking agents to "double-check your work" is nearly useless. Self-assessment is unreliable. Actually running the code and verifying behavior? That's what matters.

Why This Matters: The Jira Realization

When Anthropic published this, a bunch of developers pointed out something funny: "You just reinvented Jira." And they're right. The breakthrough here is literally breaking tasks into small pieces, tracking progress, iterating one thing at a time, requiring verification before marking complete.

That's standard software development methodology. We've been doing this for decades. So why is it a "breakthrough" for AI agents?

Because left to their own devices, agents work like undisciplined juniors - starting too many things, not finishing properly, not documenting anything, assuming code works without testing. Your summary.md hack helps them remember what happened, but it doesn't enforce discipline around how they work.

The harness forces discipline. It's project management for an employee with infinite energy but zero short-term memory.

Does It Actually Work?

Fair question. Is this just theory or does it solve the problem?

Anthropic ran the full test. Their agent built a working claude.ai clone across multiple sessions. The git history shows steady progress - one feature at a time, properly tested, properly documented. No spiral into half-finished mess. No declaring victory too early.

The testing piece matters a lot here. When Claude uses browser automation to test features - actually clicking buttons, typing in forms, seeing the results - it catches bugs that aren't obvious from just reading code. The model saying "I think this works" is unreliable. Actually running it and seeing it work? That's what matters.

There are still issues. Claude's vision can't see browser alert modals through Puppeteer, so features using those were buggier. And if an agent makes a really bad decision about what to remember, there's no way to recover that.

But the core approach works because it forces the agent to act like a professional: work incrementally, test properly, document everything, leave things in a state where the next session can continue.

From Workarounds to Architecture

Here's what separates a hack from a production system:

Your summary.md approach:

Solves: Memory loss between sessions
Doesn't solve: Premature completion, lack of verification, unclear completion criteria, no enforcement of incremental progress

Anthropic's harness:

Solves: Memory loss + enforces structure + requires verification + tracks explicit completion state + forces incremental work

Both use similar primitives (text files, git, summaries). But one is "make it work for my project" and the other is "build a reliable system that works consistently."

What This Means for Developers

This changes what humans do. If agents handle implementation, your role shifts to something like a product manager - define the specs, review the output, maintain the overall vision.

You can't just prompt your way to production-ready agents. You need to build structure that forces the AI to:

Document its work
Verify output with real tests, not self-assessment
Work in small chunks
Track progress explicitly

The Real Lesson

The best agents won't have the longest context windows or the most parameters. They'll have the best scaffolding of context engineering - the right structure, verification systems, and memory management to actually complete work across sessions.

It's not exciting or revolutionary. But it works. And maybe that's the lesson: AI engineering is landing on the same conclusions software engineering reached decades ago. Structure matters. Verification matters. Small incremental progress with clear docs beats big plans done sloppily.

The future of AI agents is not only about making them smarter. It's about making them disciplined too.

Code examples: Claude Quickstarts - Autonomous Coding