Welcome to Middle Loop Engineering

Where engineering rigour goes now that AI writes the code
Welcome to Middle Loop Engineering
A new supervisory layer is forming between coding and delivery

Even though your coding agents are getting faster (and more capable), our ability to evaluate what they produce hasn't kept up. I'm seeing this pattern across defence, financial services, insurance, and every other regulated domain I work in. Coding agents burn through backlogs and open PRs faster than anyone in the approval chain can say yes or no, and the bottleneck has moved from engineering capacity to decision-making capacity.

Thoughtworks recently published findings from a senior engineering retreat that worked through this problem.

Their core thesis:

Engineering rigour doesn't disappear when AI writes code. It migrates upstream into specs, test suites, type systems, and risk mapping. A new category of supervisory work is forming between inner-loop coding and outer-loop delivery, which they call the "middle loop."

That thesis is largely right, but the retreat has blind spots and a few of its conclusions are too optimistic for anyone running these systems in production.

The middle loop is real

You probably already know this from your own work, even without a name for it. You're breaking problems into agent-sized chunks, deciding how much to trust agent output, catching plausible-looking nonsense, and trying to keep architectural coherence when six agents are generating code in parallel.

No career ladder develops these skills, but the people who are good at it tend to be experienced engineers who think in terms of delegation rather than implementation. They've built strong mental models of how their systems actually behave under pressure, not just how the architecture diagram says they should.

TDD is now prompt engineering

The retreat is firm on one practical point: TDD changes how coding agents perform, not because TDD is generally good practice but because it prevents a specific failure mode.

Without TDD, an agent can write incorrect code and a test that confirms the incorrect code as correct, and both pieces ship together giving you false confidence that the implementation works.

When the tests exist first, the agent can't cheat that way, because the tests become deterministic validation for non-deterministic generation. That makes TDD a form of prompt engineering, and the artefact you should be reviewing is the test suite, not the implementation.

Security can't wait

Most engineers still treat security as something to handle later, after the technology works, but with agents that sequencing is dangerous and most risk registers haven't caught up.

Grant an agent email access and you've handed it password resets and account takeover. Give it full machine access for development tooling and you've given it full machine access for everything else it decides to do. The retreat's session on security had low attendance, which is its own signal.

In regulated domains this is already a compliance problem, because agent boundary controls don't map neatly to ISO 27001 or GDPR Article 32, and most organisations haven't started the gap analysis. Platform engineering needs to drive secure defaults so safe behaviour is the path of least resistance, because asking individual developers to make security-conscious choices when configuring agent access won't hold.

You can't mass-produce senior judgment

The retreat says junior developers are "more valuable than ever." AI tools get them past the unproductive phase faster, they lack legacy habits that slow adoption, and they're a call option on future productivity.

Same document, different section: the middle loop requires "strong mental models of system architecture" and the ability to "rapidly assess output quality without reading every line."

Pick one.

Strong mental models come from building systems, breaking them, debugging them at 3am, and learning what the abstractions hide. The retreat does name this risk for mid-level engineers who came up during the hiring boom without developing fundamentals, but it stops short of the obvious extension, which is that junior engineers using AI assistance from day one face the same risk, and we don't have a retraining answer for either group.

The retreat's own data points elsewhere. Staff engineers use AI tools less often than juniors but save more time per session, because they have the context to supervise effectively. The constraint isn't access to AI, it's the depth of mental model the supervisor brings to the output, and you can't shortcut that.

Transient source code is a fantasy

Some retreat participants see source code disappearing within a decade. Generated on demand, never stored.

No.

You need something to test against, diff, and audit, and if it's testable and storable, it's source code by another name.

The retreat's own example sinks the idea. It documents an agent with a linter that enforced a 500-line file limit, so the agent responded by making individual lines longer. The rule was technically satisfied but the principle behind it was completely violated.

Now imagine debugging that when source code is transient and three other agents are making concurrent changes against the same surface area.

There's a DORA regression hiding here too. AI tools make it easy to produce large changesets, which pushes teams toward bigger batches and less frequent releases, a direct reversal of a decade of research showing smaller batches correlate with higher stability. The retreat flags this as an active regression, but the implication is bigger than they let on, because most of the productivity wins from AI tooling are being offset by stability losses, and nobody is putting that on a slide.

Self-healing is a decade away, not half a decade

The retreat puts self-healing systems on a 2 to 5 year horizon, although for complex legacy enterprise environments with accumulated technical debt that's wildly optimistic.

💡
We already have narrow automated remediation, but reliable self-healing systems are still probably years, maybe decades, away. The missing piece is not just better agents, but agents with action-oriented world models: systems that can reason about state, causality, interventions, side effects, and rollback in a live environment.

The retreat itself lists prerequisites that don't exist: a change ledger, an agent identity and permissions system, fitness functions defining "healthy" in terms agents can evaluate, and strong rollback and feature flag capabilities, and most enterprises don't have half of these.

There's a harder problem the retreat calls the "latent knowledge problem." Your most experienced on-call engineers know that high CPU on a specific service means checking the database connection pool, and they know a particular error code is a symptom of something three layers down. This knowledge lives in people's heads, almost never documented.

Replicating it requires what the retreat calls an "agent subconscious": a knowledge graph built from years of post-mortems and incident data, but that graph doesn't exist at most companies, and building it is a multi-year effort before you start on the orchestration layer.

There's also a behavioural mismatch. Experienced engineers are sceptical by nature when debugging production, but LLMs tend toward agreement. The retreat suggests "angry agents" designed to challenge the dominant hypothesis, which is a nice idea, although reliable adversarial agent behaviour in high-stakes production is a research problem, not a sprint item.

What enterprise agent work actually looks like

The retreat has another framing that gets less attention than it should. Most enterprise agent deployment will look less like swarming and more like "patrol workers on loops": agents running ETL transforms, data quality checks, and business process monitors on continuous cycles. Always-on quality and monitoring rather than autonomous code generation.

This is the unglamorous version of agentic AI, but it's also the version that ships first in regulated domains, because the verification surface is small and the blast radius is contained. Organisations with strong, well-designed APIs are positioned for this, but organisations without are not.

Decision-making bottleneck

All of this comes back to the same tension: coding agents are producing output faster than humans can evaluate it. I call it the decision-making bottleneck.

The retreat documents this already happening, with agents generating code fixes, feature implementations, even job specifications faster than anyone can approve them. It treats this as one concern among many, but I think it's the concern, and everything else is downstream.

The organisations that capture the productivity gains will be the ones that figure out which decisions need human judgment and which can go to deterministic verification: test suites, type systems, risk tiers, automated compliance checks. That's the real work of middle loop engineering, which is less about supervising agents and more about deciding what doesn't need supervision.

The rest of us will just have faster-moving traffic jams.

Join AI engineers getting weekly insights on agents, RAG & production LLM systems

No spam, no sharing to third party. Only you and me.

Member discussion