| Before AI agents | After AI agents |
|---|---|
| Architecture decisions — Human | Taste invariants & constraint design — Human (where judgment now lives) |
| System design — Human | Knowledge architecture — Human (where judgment now lives) |
| Writing code — Human (where judgment lived day-to-day) | Acceptance criteria as system input — Human (specifying intent precisely) |
| Code review & taste enforcement — Human | Writing code, enforcing taste — Agent |
What is harness engineering?
Harness engineering is the practice of designing the constraints, documentation structures, linters, feedback loops, and architectural rules that keep AI-generated code coherent and maintainable over time. The harness is everything the agent operates inside — not the code it writes, but the environment that governs how it writes.
The term gained traction after OpenAI published a detailed account of building a real product entirely with AI agents, writing zero lines of application code by hand. Thoughtworks Distinguished Engineer Birgitta Böckeler then analysed that experiment on Martin Fowler's blog, asking the question that is now spreading through engineering leadership circles: what's your harness today?
The quick definition: a harness is what replaces the senior engineer's intuition when that intuition has to scale across a system that runs without human hands on the keyboard.
Why this matters for engineering leaders in 2026
The AI-replaces-engineers narrative has dominated for two years. Harness engineering is the evidence that the story was wrong — or at least, asking the wrong question.
What the OpenAI experiment showed: a team of three engineers shipped a million lines of code in five months. Throughput increased as the team grew to seven — not because more engineers wrote more code, but because more human judgment was encoded into the system. Each addition to the harness compounded.
Human engineers never typed application code. They typed rules, constraints, architectural invariants, and documentation structures. When the agent struggled, the team's question was never "try harder." It was always: what's missing from the environment?
The implication: the skill that compounds in an agent-first world isn't coding speed. It's the ability to make engineering judgment explicit, legible, and mechanically enforceable.
The three layers where engineering judgment now lives
1. Taste Invariants
Human inputRules that encode what good code looks like — structured logging, naming conventions, file size limits, boundary enforcement. Written into linters. Run on every line the agent produces.
2. Knowledge Architecture
Human inputDeciding what the system is allowed to know, how it's organised, and how stale information gets caught and corrected. A structured docs directory, not a single large instruction file.
3. Acceptance Criteria as System Input
Human inputSpecifying intent precisely enough that an agent can reproduce a bug, validate a fix, open a pull request, and merge — without a human in the loop. Harder than it sounds.
Layer 1: Taste invariants
In a human-led team, these live in senior engineers' heads and surface in code reviews. In a harness, they are written into custom linters that run on every line the agent produces. The judgment call is still human. The enforcement is automated.
Layer 2: Knowledge architecture
The OpenAI team learned early that a single large AGENTS.md file fails. Context is a scarce resource — too much guidance becomes no guidance. Instead, they built a structured docs directory treated as the system of record, with the AGENTS.md acting only as a table of contents pointing deeper. Deciding what the system is allowed to know, how it's organised, and how stale information gets caught is an engineering judgment call at a level most engineers have never been asked to operate at.
Layer 3: Acceptance criteria as system input
In the OpenAI setup, a human engineer describes a task. The agent reproduces a bug, validates a fix, records a video, opens a pull request, responds to review feedback, and merges the change — without human intervention. That's only possible because the acceptance criteria were specified precisely enough to be machine-actionable. Writing criteria that an agent can act on is a harder skill than writing criteria for a human. It forces clarity that most teams have always been able to avoid.
How the shift from execution to environment actually works
The harness flywheel
- 1Human encodes a judgment rule — scales instantly
- 2Agent applies it everywhere — catches drift automatically
- 3Garbage collection runs — compounds over time
In a traditional engineering team, judgment is embedded in every keystroke. When you write code yourself, your intuitions about what's brittle, what's maintainable, what's likely to break, are expressed directly in what you build. You don't have to articulate them. You just use them.
When an agent writes the code, that embedded judgment disappears from the output unless you put it somewhere first.
The OpenAI team describes this shift clearly: their most difficult challenges are now about designing environments, feedback loops, and control systems — not features, not performance, but systems that govern how the agent operates over time.
Birgitta Böckeler frames it as making explicit what developer experience used to make implicit. The harness externalises the senior engineer's knowledge — the part that used to walk out the door when they resigned.
Harness engineering vs traditional software development
| Dimension | Traditional development | Harness engineering |
|---|---|---|
| Where judgment lives | Embedded in the code itself | In the constraints on code |
| How taste is enforced | Code review by humans | Custom linters and structural tests |
| Knowledge storage | People's heads, Slack, Docs | Versioned repository artifacts |
| Entropy management | Sprint-based tech debt sessions | Continuous automated garbage collection |
| Human bottleneck | Writing and reviewing code | Specifying intent precisely enough |
| Cost of bad patterns | Caught in review, addressed in sprints | Replicated at agent speed if harness misses them |
What OpenAI's experiment actually proved
Note: OpenAI has a commercial interest in demonstrating that AI can write production code at scale. The experiment used a greenfield codebase, which is far easier to harness than an inherited one. And as Birgitta Böckeler notes, the writeup says nothing about verification of functional behaviour — only structural quality. What follows is my read of what the evidence does and doesn't support. The caveats matter.
With that said, here is what the experiment does demonstrate:
It took five months to build the harness before trusting the output. This isn't a shortcut. It's front-loaded discipline that replaces the ongoing discipline of human code review. Teams treating AI coding as immediate speed gain are skipping this investment and will pay for it later.
Throughput increased as the team grew, not because more engineers wrote more code, but because each new human judgment encoded into the harness compounded across every subsequent agent run. The returns were cumulative.
Human taste was captured once and enforced continuously. The Friday cleanup sessions that were consuming 20 percent of working time — what the team called addressing "AI slop" — were replaced by background agents running on a regular cadence, catching deviations and opening targeted refactoring pull requests. Most could be reviewed and merged in under a minute.
The work didn't go away. It moved earlier and became more leveraged.
The mistake most teams are making
Path A: No harness
AI as a faster typist
Code generated at speed
↓ Bad patterns replicate at scale
↓ Entropy compounds
Path B: With harness
AI with an environment it operates inside
Judgment encoded into constraints
↓ Taste enforced on every line produced
↓ Quality compounds
Most teams experimenting with AI coding tools are treating the agent as a faster typist. They generate code, review it, and merge it — with the same human-review bottleneck they've always had, now running at a higher throughput than humans can comfortably handle.
That's not harness engineering. That's acceleration without infrastructure.
The result is what the OpenAI team called entropy: the agent replicating existing patterns in the codebase, including the bad ones. Without a harness, AI-generated code doesn't improve a codebase's quality over time. It amplifies whatever quality was already there, good or bad.
The subtler mistake is treating harness engineering as a technical problem. It isn't. It's a judgment problem. The question isn't which linter to use. It's: what does your team actually believe good software looks like, and can you say it precisely enough that a system can enforce it?
Most teams have never had to answer that question. They've relied on institutional knowledge that moves from person to person in code reviews and never gets written down.
How to start building your own harness today
You don't need a million-line codebase to start. Birgitta's question is the right starting point: what's your harness today?
Step 1: Audit what your senior engineers catch in code review. The patterns they flag repeatedly are your taste invariants. Write three of them down as explicit rules, not preferences. If a senior engineer would flag it in every review, it should be a linter.
Step 2: Look at your documentation structure. Is it built for a human reader navigating it once, or for a system that needs to find the right answer every time? Start with an index — a single stable entry point that maps to deeper sources of truth — rather than a single sprawling document that rots.
Step 3: Pick one architectural constraint and make it mechanical. Not a convention. An enforced rule. A linter, a structural test, a CI check. Something that runs without a human deciding whether to apply it on any given day.
That's the beginning of a harness. The discipline shows up in the scaffolding, not the code.
Frequently asked questions
Does harness engineering only apply to teams already using AI coding agents?
No, but it's most urgent there. The principles apply to any team trying to maintain architectural coherence at scale. The harness makes explicit what large human teams have always enforced through culture and peer review — it just does so in a way that can run automatically.
How long does it take to build a working harness?
The OpenAI team spent five months before the harness was reliable enough to trust for end-to-end feature development. It's not a weekend project. It's an infrastructure investment that front-loads discipline in exchange for compounding leverage.
What happens with existing legacy codebases?
Birgitta raises this directly. Retrofitting a harness onto a non-standardised legacy codebase may not be worth the effort — similar to running static analysis on a codebase that has never had it and drowning in alerts. The approach is most practical for new systems or significantly refactored domains.
Is harness engineering the same as context engineering?
Overlapping, not identical. Context engineering focuses on what information the agent can access at runtime. Harness engineering is broader: it includes architectural constraints, taste invariants, feedback loops, and garbage collection processes, not just context management.
What skills matter most for engineers in a harness-first world?
Systems thinking, precision in expressing requirements, and the ability to make implicit judgment explicit and enforceable. The ability to write code quickly matters less than the ability to design environments that produce good code reliably.
Key takeaways
- AI agents didn't remove engineering judgment. They relocated it from writing code to designing the systems that govern how code is written.
- A harness is the set of constraints, documentation structures, linters, and feedback loops that keep AI-generated code coherent. It externalises what used to live in senior engineers' heads.
- The OpenAI team's experiment required five months of harness-building before it worked reliably. This isn't a shortcut — it's front-loaded discipline.
- Engineers who struggle in agent-first environments aren't the ones who can't code. They're the ones who can't articulate why something is wrong precisely enough for a system to act on it.
- Without a harness, AI amplifies existing code quality at scale — good or bad. The harness is what makes the amplification safe.
- The discipline in software is moving upstream: from the code to the scaffolding, from intuition to enforceable rules.
The question worth asking now: can you state your engineering taste as a rule a linter could check?
Sources
Ryan Lopopolo, OpenAI — Harness engineering: leveraging Codex in an agent-first world (February 2026)
Birgitta Böckeler, martinfowler.com — Harness Engineering — first thoughts (February 2026)
Birgitta Böckeler, martinfowler.com — Harness Engineering (follow-up considered article)
Related reading: moving from prompt and context engineering toward harness engineering, skills make judgement reusable, and agent loops, tokenomics, and the harness.