Harness Engineering: The Outer System That Makes Agents Reliable
A lot of what gets blamed on the model is really a problem with the harness around it. I think more teams need to start looking there first.
For the last few years, most of the conversation has been about model quality, prompting, and then context engineering. All of that matters. But if you are trying to build an agent that does real work over time, those are no longer the whole story. The bigger story is the system around the model.
That is what I mean by harness engineering.
A model can reason, classify, plan, and generate. That is impressive, but it is still only raw capability. On its own, it does not persist state across sessions, manage permissions, decide what context matters, verify whether its work is correct, or leave behind a clean execution trail that an operator can inspect. Those are not side concerns. They are the difference between a neat demo and a dependable agent.
So my thesis is simple:
Building a good harness is what separates a good agentic implementation from a great one.
Put more bluntly, better models alone will not solve agent reliability. In practice, I increasingly think harness engineering subsumes both prompt engineering and context engineering. Prompting is about how we ask. Context engineering is about what we show. Harness engineering is the broader discipline that decides the full operating environment in which the model works.
The result is a more useful mental model:
The model provides intelligence. The harness makes that intelligence usable.
What a harness is
I define an agent harness broadly.
A harness is the complete system around the model that makes it able to do useful work reliably. That includes prompts, memory, retrieval, tools, execution environments, orchestration logic, verification loops, guardrails, and observability.
This broader definition matters because a lot of agent discourse still underestimates how much of agent behaviour comes from the outer system rather than the model itself. If your team says “the agent did X”, it is usually worth asking a sharper question: did the model do that, or did the harness make that possible?
That distinction is not academic. It changes where you look when things fail, and it changes what you invest in when you want the system to improve.
A poor harness leaves model capability on the table. A good one makes far better use of it.
That is why I think the line “the model is commodity; the harness is moat” lands so well, even if it is slightly provocative. I would phrase it a bit more carefully like this: model capability sets the ceiling, but harness quality determines how much of that ceiling you actually realise in production.
Why models need harnesses
The clearest way to understand a harness is to work backwards from what we want an agent to do and ask what the model cannot do cleanly on its own.
We want agents to remember what happened before, access relevant information at the right moment, take actions in external systems, continue across long tasks without losing coherence, verify whether they have actually solved the problem, and leave behind enough traceability for humans to inspect and improve the system.
A raw model does not naturally give us that.
Yes, a model can produce plans and propose actions. But it does not, by itself, maintain durable state. It does not decide what to retrieve from a memory store. It does not create its own safe runtime boundary. It does not reliably determine whether an output is merely plausible or actually correct. It does not produce observability in the way operators need it.
That is the role of the harness.
This is also why I do not find “just use a better model” persuasive as a default answer to agent reliability problems. Stronger models help. Of course they do. But a more capable model inside a sloppy harness is still a sloppy system. In many real deployments, reliability is bottlenecked less by raw intelligence and more by missing structure around that intelligence.
The question is no longer only, “How smart is the model?”
It is also:
- What can it see?
- What can it do?
- What can it remember?
- How does it verify itself?
- How do we inspect failure and recover?
Those are harness questions.
Harness primitives
I find it useful to separate primitives from patterns.
Primitives are the core capabilities the harness gives the model. Patterns are the recurrent ways we combine those capabilities to make work actually happen. If you blur the two together, the discussion becomes mushy very quickly.
The six primitives I find most useful are these.
1. Memory and context management
This primitive governs what the model knows at any moment.
That includes system instructions, retrieved documents, memory stores, working state, summaries of prior steps, and context compaction. It is the part of the harness that decides what enters the context window, what stays out of it, what gets compressed, and what gets persisted elsewhere.
This matters because context is scarce. More context is not automatically better context. In fact, one of the most common ways to make an agent worse is to stuff it with too much undifferentiated material and call that sophistication.
A good harness treats context as allocation, not accumulation.
2. Tools and runtime
This primitive governs what the model can do.
Tools expose actions. Runtime provides the environment in which those actions happen. That might include APIs, databases, browsers, sandboxes, queues, or enterprise systems. The model chooses from an action surface, but the harness defines that surface and its boundaries.
This is why tool access alone is not enough. Giving a model a dozen tools does not magically create a dependable agent. The harness still has to define the shape of action, the constraints around execution, and the way results flow back into context.
3. Orchestration and coordination
This primitive governs how work is sequenced.
Some tasks are one-shot. Many are not. Real agents often need planning steps, retries, branching logic, hand-offs, escalation points, and occasionally multiple cooperating agents. Even when people say “multi-agent system”, much of what they really mean is orchestration logic inside the harness.
The model may generate a plan, but the harness decides how work moves.
4. Runtime guardrails
I separate guardrails from tools because permissioning, isolation, and recoverability deserve explicit attention.
A runtime is not just a place where actions happen. It is also the boundary that decides what the agent is allowed to touch, when approvals are required, how failures are contained, and what happens when execution goes wrong.
This is not only about safety in the abstract. It is about practical reliability. An agent that cannot be constrained is not autonomous. It is brittle.
5. Evaluation and verification
This primitive governs how the system checks whether work is correct.
Without it, the agent is just producing fluent guesses. With it, the system can compare output against tests, rules, expected state transitions, downstream feedback, or human approvals.
This is one of the deepest shifts in agent design. The first answer should not be trusted just because it sounds good. The harness needs a way to ground the model against reality.
6. Tracing and observability
This primitive governs whether the system is inspectable.
Operators need execution traces, logs, tool trajectories, latency, costs, and failure records. Without observability, teams cannot debug, govern, or improve the harness. They can only react to symptoms.
A lot of supposedly intelligent behaviour becomes much less mysterious once you can inspect the trace. And many supposed model failures turn out to be poor retrieval, weak tool schemas, missing verification, or bad orchestration.
Harness patterns
If primitives are the building blocks, patterns are the recurring designs that make them useful.
A few patterns show up again and again.
The first is plan, act, verify. The model does not just generate an answer. It decomposes the task, takes action, checks the result, and then continues. This is one of the simplest ways to turn a raw model interaction into a more dependable work loop.
The second is progressive disclosure. Do not load everything into context up front. Start with minimal instructions and fetch more specific context, skills, or documentation only when the task requires it. This reduces clutter and helps resist context rot.
The third is state outside the context window. Long-running work cannot depend entirely on transient conversation state. The harness needs durable state, whether that is a memory file, a workflow record, a task ledger, or a structured store the agent can revisit.
The fourth is clean-context continuation. When a task grows long, sometimes the right move is not to keep stuffing the same context window. It is to persist the current state, re-enter with a cleaner context, and resume from a more legible checkpoint.
The fifth is approval checkpoints. Not every action should be fully autonomous. Some transitions should require a human decision, especially where there is irreversible business impact. Good harnesses do not maximise autonomy at all costs. They place autonomy where it is useful and checkpoints where they are prudent.
These are not implementation details. They are recurring design patterns that make the difference between an agent that merely acts and an agent that can be operated.
Common failure modes
This is where harness thinking becomes concrete.
When agents fail in production, they usually fail in recognisable ways.
Context rot
The agent starts strong, then degrades as the task continues. Relevant facts blur together. Noise accumulates. Reasoning quality drops. This is usually not solved by bigger prompts. It is solved by better context management.
Brittle tool use
The agent technically has tools, but uses them poorly. It calls the wrong one, misreads the results, or loops through actions without real progress. This often points to weak tool design, bad result shaping, or poor orchestration.
Early stopping
The agent stops at the first plausible answer rather than the completed task. It gives something fluent enough to sound done, but the underlying work is partial. This is a classic verification and orchestration problem.
No durable state
The agent forgets what mattered between sessions or across long-running work. It repeats steps, loses progress, or fails to build continuity. This is a memory and workflow-state problem.
Verification theatre
The system appears to be checking itself, but the checks are shallow, circular, or disconnected from reality. It “verifies” outputs in ways that do not actually test correctness. This is worse than having no verification, because it creates false confidence.
Un-debuggable execution
Something went wrong, but nobody can reconstruct why. There is no usable trace, no meaningful log, and no clean view of the decision path. At that point, improvement becomes guesswork.
Notice what ties these failure modes together. They are not primarily about intelligence in the abstract. They are about system design.
Trade-offs
A better harness is not a free lunch.
Every primitive introduces trade-offs, and mature teams take those seriously.
- More tools create a richer action surface, but they also increase complexity and the chance of misuse.
- More memory improves continuity, but it can also introduce stale or conflicting state.
- More orchestration can improve reliability, but it can also make systems slower, harder to reason about, and more brittle if the control flow becomes too elaborate.
- More verification improves quality, but it costs time, compute, and latency.
- More guardrails reduce risk, but they can also constrain useful autonomy if they are designed too bluntly.
- More observability helps operators, but it also increases instrumentation overhead and demands better discipline in how traces are analysed.
This is one reason I do not like simplistic agent conversations that reduce everything to “give the model more tools” or “use a bigger context window”. Harness engineering is about making structured trade-offs in pursuit of dependable work.
A simple maturity model
Most teams do not move from prompt wrapper to robust agent system in one jump. They progress through stages.
Level 1. Prompt wrapper
This is a model with instructions and maybe a little conversational history. Useful for assisted interactions. Weak for real agency.
Level 2. Tool-using agent
The model can call tools and trigger actions. This is where many teams stop and declare victory too early. Tool access creates capability, but not reliability.
Level 3. Stateful harness
The system persists working state, retrieves relevant context, and can continue across longer tasks with less drift. This is where the agent starts to feel less like a demo and more like a system.
Level 4. Verified harness
The harness grounds outputs against checks, tests, rules, approvals, or downstream state. The system no longer assumes fluent output is sufficient.
Level 5. Operable system
The system is observable, governable, and improvable. Operators can inspect traces, understand failure, and refine the harness over time.
The important point is not the labels. It is the progression. Most agent teams overestimate how far they have advanced because they confuse action with reliability.
Why this remains durable even as models improve
Some of today’s harness features will absolutely be absorbed into future models.
Models will get better at planning, better at tool use, better at self-checking, and better at staying coherent over longer horizons. I expect that. The boundary between model and harness will keep shifting.
But I do not think that makes harness engineering temporary.
Even if models internalise more of what we currently scaffold from the outside, the outer system still matters. Agents still need access to real environments, explicit permission boundaries, durable state, verification against external reality, and operational visibility. Those are not temporary patches. They are part of what it means to turn intelligence into dependable work.
That is why I think harness engineering is durable as a discipline. And it also changes the role of the engineer.
Increasingly, the leverage is not only in writing business logic by hand. It is in designing the environment in which model-driven work happens: the constraints, the memory architecture, the execution model, the verification loop, and the observability needed to improve the whole thing over time.
That is a different kind of engineering. More systems-oriented. More operational. More architectural. In many cases, more important.
The operator's lens
If I were explaining this to a team, I would end with five questions.
- What can the model see?
- What can it do?
- What can it remember?
- How does it know whether it is right?
- How do we inspect and improve failure?
Those five questions are more useful to me than asking whether a system is “agentic”. They force attention onto the actual operating conditions that determine reliability.
That, to me, is the heart of harness engineering. The model still matters. But the model is no longer the whole game. The outer system is where a lot of the real engineering now lives. And if there is one practical takeaway I would push teams toward, it is this:
Treat context as a systems problem, not a prompt problem.
Because once you do that, you stop treating agent quality as a matter of clever wording and start designing the machinery that makes intelligent behaviour dependable.
No spam, no sharing to third party. Only you and me.
Member discussion