Harness Engineering: The Outer System That Makes Agents Reliable
A lot of what gets blamed on the model is really a problem with the harness around it. I think more teams need to start looking there first.
For the last few years, most of the conversation has been about model quality, prompting, and then context engineering. All of that matters. But if you are trying to build an agent that does real work over time, those are no longer the whole story. The bigger story is the system around the model.
That is what I mean by harness engineering.
A model can reason, classify, plan, and generate. That is impressive, but it is still only raw capability. On its own, it does not persist state across sessions, manage permissions, decide what context matters, verify whether its work is correct, or leave behind a clean execution trail that an operator can inspect. Those are not side concerns. They are the difference between a neat demo and a dependable agent.
So my thesis is simple:
Building a good harness is what separates a good agentic implementation from a great one.
Put more bluntly, better models alone will not solve agent reliability. In practice, I increasingly think harness engineering subsumes both prompt engineering and context engineering. Prompting is about how we ask. Context engineering is about what we show. Harness engineering is the broader discipline that decides the full operating environment in which the model works.
The result is a more useful mental model:
The model provides intelligence. The harness makes that intelligence usable.
What a harness is
I define an agent harness broadly.
A harness is the complete system around the model that makes it able to do useful work reliably. That includes prompts, memory, retrieval, tools, execution environments, orchestration logic, verification loops, guardrails, and observability.
This broader definition matters because a lot of agent discourse still underestimates how much of agent behaviour comes from the outer system rather than the model itself. If your team says “the agent did X”, it is usually worth asking a sharper question: did the model do that, or did the harness make that possible?
That distinction is not academic. It changes where you look when things fail, and it changes what you invest in when you want the system to improve.
A poor harness leaves model capability on the table. A good one makes far better use of it.
That is why I think the line “the model is commodity; the harness is moat” lands so well, even if it is slightly provocative. I would phrase it a bit more carefully like this: model capability sets the ceiling, but harness quality determines how much of that ceiling you actually realise in production.
Why models need harnesses
The clearest way to understand a harness is to work backwards from what we want an agent to do and ask what the model cannot do cleanly on its own.
We want agents to remember what happened before, access relevant information at the right moment, take actions in external systems, continue across long tasks without losing coherence, verify whether they have actually solved the problem, and leave behind enough traceability for humans to inspect and improve the system.
A raw model does not naturally give us that.
Yes, a model can produce plans and propose actions. But it does not, by itself, maintain durable state. It does not decide what to retrieve from a memory store. It does not create its own safe runtime boundary. It does not reliably determine whether an output is merely plausible or actually correct. It does not produce observability in the way operators need it.
That is the role of the harness.
This is also why I do not find “just use a better model” persuasive as a default answer to agent reliability problems. Stronger models help. Of course they do. But a more capable model inside a sloppy harness is still a sloppy system. In many real deployments, reliability is bottlenecked less by raw intelligence and more by missing structure around that intelligence.
The question is no longer only, “How smart is the model?”
It is also:
- What can it see?
- What can it do?
- What can it remember?
- How does it verify itself?
- How do we inspect failure and recover?
Those are harness questions.
Subscribe to continue reading