The Harness Is the Product

Where does product quality live in an LLM-based system? A leaked source and a detailed postmortem, both from Anthropic in the last four weeks, make the answer unusually concrete.
The Harness Is the Product
Claude Code inside an engineering harness

Two things happened to Anthropic in the last four weeks.

On March 31, a misconfigured npm package exposed the entire Claude Code source. A source map had been bundled into version 2.1.88. Within hours the TypeScript was reconstructed, mirrored across GitHub, and torn apart by the community.

On April 23, Anthropic published a detailed postmortem explaining why Claude Code had felt degraded for the preceding six weeks.

The two events point at the same thing from different angles. The leak made the Claude Code harness visible. The postmortem showed what happens when it breaks. Together they answer a question most teams shipping LLM products still hand-wave: where does product quality actually live? Not in the model. In the harness around it.

Seen

The leaked source is roughly 512,000 lines of TypeScript across 1,906 files. A 46,000-line query engine. About 40 tools in a plugin architecture. A cache-aware modular system prompt split at a symbol called SYSTEM_PROMPT_DYNAMIC_BOUNDARY, where everything before it (instructions, tool definitions) is cached globally across organisations and everything after (your CLAUDE.md, git status, current date) is session-specific. Sabrina's detailed analysis documents most of it.

The harness is not a thin wrapper around a model API. It is the product, by line count and by design effort. It contains instrumented, A/B tested decisions at the level of individual prompt clauses. Boundaries that most engineers would treat as implementation details, like where the cache line sits between shared and session-specific state, are carefully designed. None of this is a surprise in principle to anyone shipping production agents.

Inside the system prompt, the leaked source contains an instruction capping length between tool calls at 25 words and final responses at 100 words. A comment next to it records the measurement that justified the line: roughly a 1.2% reduction in output tokens against a softer "be concise" phrasing.

Broken

The postmortem identifies three issues, none of which touched model weights. All three sat in the harness.

"After investigation, we identified three different issues:
On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode.
On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions.
On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7."

The first was a default change. In early March, Anthropic lowered Claude Code's default reasoning effort from high to medium to cut latency. Internal evals showed a small intelligence drop paired with a large latency win, which read like a reasonable trade. Users disagreed. The default was reversed on April 7, six weeks later.

The second was a caching bug. A change meant to clear stale thinking history once per idle session cleared it every turn instead:

The implementation had a bug. Instead of clearing thinking history once, it cleared it on every turn for the rest of the session. After a session crossed the idle threshold once, each request for the rest of that process told the API to keep only the most recent block of reasoning and discard everything before it. This compounded: if you sent a follow-up message while Claude was in the middle of a tool use, that started a new turn under the broken flag, so even the reasoning from the current turn was dropped. Claude would continue executing, but increasingly without memory of why it had chosen to do what it was doing. This surfaced as the forgetfulness, repetition, and odd tool choices people reported.

The bug shipped on March 26 and survived until April 10. Anthropic are candid about the review layers it passed through:

 "The changes it introduced made it past multiple human and automated code reviews, as well as unit tests, end-to-end tests, automated verification, and dogfooding. Combined with this only happening in a corner case (stale sessions) and the difficulty of reproducing the issue, it took us over a week to discover and confirm the root cause."

The third issue is where the leak helps. On April 16, ahead of the Opus 4.7 launch, Anthropic added a 25-word instruction to the Claude Code system prompt to reduce verbosity. Internal evals at the time showed no regression. A broader ablation run after users complained found it cost 3% on coding quality across both Opus 4.6 and 4.7. The line was removed on April 20.

The measurement asymmetry

The line itself, per the postmortem:

“Length limits: keep text between tool calls to ≤25 words. Keep final responses to ≤100 words unless the task requires more detail.”

The leaked source records the A/B measurement that justified this line: a 1.2% reduction in output tokens against softer phrasing. The postmortem reports a 3% coding-quality drop on a wider ablation suite run after users complained. Both numbers are real. They measure different things.

Token-reduction instrumentation was well-built. Quality instrumentation at that level of granularity was not, until users forced the question. What shipped was what the standard tests could see.

Your measurement surface determines what you can optimise and what you can only accidentally break. If your evals cover latency and token spend well and coding quality thinly, latency and token spend will improve and coding quality will drift. You will ship regressions and not know until users tell you.

The evals didn't catch it

Anthropic are direct about this:

"neither our internal usage nor evals initially reproduced the issues identified."

Anthropic are not a team that skimps on evaluation. The verbosity line still passed weeks of internal testing without a signal, and the caching bug survived weeks of dogfooding on top of that. Both only surfaced once broader coverage was applied after user complaints.

The caching bug was eventually diagnosed by back-testing the offending pull requests with Code Review, running Opus 4.7:

"As part of the investigation, we back-tested Code Review against the offending pull requests using Opus 4.7. When provided the code repositories necessary to gather complete context, Opus 4.7 found the bug, while Opus 4.6 didn't. "

Opus 4.7 caught what the other review layers missed. If you are running agent-assisted code review, that is worth knowing.

If Anthropic can miss regressions of this size with the evals they have, the rest of us should not be sanguine about ours. Eval coverage is the ceiling on product quality, and most teams are operating well below where they think they are.

The harness as the product

Every regression in this postmortem lived above the model: a default configuration, a caching flag, a line of prompt text. Every one of them is the kind of artefact the leak made visible. This is the surface area I described in Harness Engineering: prompts, context management, tool-calling logic, caching strategy, orchestration. The weights supply capability; the harness decides how much of it reaches the user.

Teams that treat the model as the product and the harness as glue will keep shipping regressions they cannot explain.

💡
Further reading. If the harness-as-product argument is new to you, I unpacked it at length in Harness Engineering: the outer system that makes agents reliable. Prompts, context management, tool-calling logic, caching strategy, orchestration. What they are, why they matter, and how to think about building them.

Going forward

The commitments section of the postmortem reads as a harness engineering roadmap.

"we’ll ensure that a larger share of internal staff use the exact public build of Claude Code (as opposed to the version we use to test new features); and we'll make improvements to our Code Review tool that we use internally, and ship this improved version to customers...."

and

"We’re also adding tighter controls on system prompt changes. We will run a broad suite of per-model evals for every system prompt change to Claude Code, continuing ablations to understand the impact of each line, and we have built new tooling to make prompt changes easier to review and audit. We've additionally added guidance to our CLAUDE.md to ensure model-specific changes are gated to the specific model they're targeting. For any change that could trade off against intelligence, we'll add soak periods, a broader eval suite, and gradual rollouts so we catch issues earlier..."

None of the commitments touch model training or weights. They are all in the outer system.


The two events of the last four weeks are different categories of transparency, one involuntary and one deliberate. Together they turn an abstract argument into a specific case, with dates, flags, line counts, and a 25-word verbosity instruction. For anyone shipping LLM products, the postmortem is the important document of the two.

Read the full postmortem.

Join engineers getting weekly insights on agents, RAG & production LLM systems

No spam, no sharing to third party. Only you and me.

Member discussion