What harness engineering is
What is harness engineering?
Harness engineering is the practice of designing the constraints, documentation structures, linters, feedback loops, and architectural rules that keep AI-generated code coherent and maintainable over time. The harness is everything the agent operates inside — not the code it writes, but the environment that governs how it writes.
Where did the term 'harness engineering' come from?
The term gained widespread traction in February 2026 when Ryan Lopopolo at OpenAI published an account of building a production product entirely with AI agents — zero lines of application code written by hand. Thoughtworks Distinguished Engineer Birgitta Böckeler then analysed the experiment on Martin Fowler's blog, popularising the term 'harness engineering' for this discipline.
What is the difference between harness engineering, prompt engineering, and context engineering?
Prompt engineering shapes the instructions sent to the model for a specific request (per-call). Context engineering decides what information fills the context window for that request (per-turn). Harness engineering builds the persistent environment — tool loops, linters, docs, approval gates, audit trails — that governs how the agent operates across all calls. They are complementary layers, not substitutes.
Is harness engineering the same as DevOps for AI?
They overlap but aren't identical. DevOps for AI covers model deployment, inference infrastructure, monitoring, and MLOps pipelines. Harness engineering is specifically about the environment AI coding agents operate inside — the conventions, docs, linters, and feedback loops that govern what code they produce. Harness engineering is the software engineering discipline; AI DevOps is the infrastructure discipline.
Who needs harness engineering?
Any team using AI agents that autonomously write, review, or merge code at scale. If engineers are reviewing and merging more AI-generated PRs than they can comfortably assess for convention compliance, the harness is already the constraint. Teams using AI for autocomplete only (where a human accepts each suggestion) need it less urgently — but benefit from it as adoption grows.
Building the harness
How long does it take to build a working harness?
The OpenAI team spent five months building a harness before trusting it for end-to-end autonomous feature development on a greenfield codebase. For teams starting on an existing codebase, the timeline depends heavily on convention consistency. A minimum viable harness — AGENTS.md, three convention docs, three linter rules — can be built in a week. A production-ready harness that covers most of the codebase takes months.
What is the first thing to build in a harness?
An AGENTS.md file at the repository root that covers: what the codebase is, the tech stack, how to run it, and three explicit coding conventions with examples. Keep it under 150 lines. This is the entry point every agent reads — getting it right before adding complexity to the rest of the harness is essential.
What goes in AGENTS.md?
Three categories: (1) Orientation — what the codebase is, the tech stack, how to run it; (2) Taste invariants — explicit coding conventions with examples and counter-examples; (3) Pointers — links to deeper docs for architecture decisions, API contracts, test strategy. The AGENTS.md should be an index that points to a structured docs directory, not a monolithic document.
How many linter rules do I need before the harness is useful?
Three well-chosen rules are enough to start seeing consistent improvement. The criteria for selection: pick the three conventions your senior engineers flag in code review most often. If a pattern comes up in every third PR review, it should be a linter. Quality matters more than quantity — three specific, mechanically-enforced rules beat twenty vague ones.
Does the harness need to be built all at once?
No — and it shouldn't be. Build incrementally in four phases: (1) Foundation — AGENTS.md, docs directory, convention files; (2) Enforcement — linters, structural tests, CI integration; (3) Task design — acceptance criteria templates, reproduction steps; (4) Maintenance — garbage collection, docs review process, quality metrics. Each phase compounds with the previous.
What is garbage collection in a harness context?
A scheduled background agent that scans the codebase for violations of documented conventions and opens targeted pull requests to fix them. Each PR should address one instance of one violation and be reviewable in under a minute. Garbage collection replaces periodic cleanup sprints with continuous automated remediation.
AGENTS.md specifics
How long should AGENTS.md be?
Under 150 lines. If it's longer, content that belongs in the docs directory is being put in the index. The AGENTS.md should be an entry point the agent can read in seconds — not a comprehensive guide to every convention.
What is the difference between AGENTS.md, CLAUDE.md, and .cursorrules?
AGENTS.md is the cross-tool convention used by OpenAI Codex and many tools. CLAUDE.md is used by Anthropic's Claude Code CLI and supports hierarchical (project + user) configuration. .cursorrules is used by Cursor IDE and is being superseded by .cursor/rules/ for file-level granularity. For teams using multiple tools, the best approach is one canonical docs/ directory with each tool-specific file pointing to it.
What should NOT be in AGENTS.md?
Business context the agent doesn't need to write code; full inline examples (those belong in the docs directory); aspirational conventions not yet enforced in the codebase; instructions that should be CI checks; and vague preferences like 'write clean code' that give the agent nothing actionable to check against.
How do I keep AGENTS.md from becoming stale?
Three mechanisms: (1) Treat docs as code — any PR that changes a convention updates the relevant doc file; (2) Run a garbage collection agent that flags divergence between documented conventions and actual code patterns; (3) Add a 'Last reviewed' date and owner to each docs section, with a review interval, so staleness becomes visible.
Should every repository have its own AGENTS.md?
Yes. Every repository where AI agents operate should have one. In a monorepo, a root AGENTS.md sets the baseline; package-level files can extend or override it for packages with different conventions.
Quality and entropy
What is AI coding entropy?
The tendency of AI-generated codebases to accumulate inconsistent patterns at scale. AI agents replicate existing code patterns — including bad ones — proportionally to how often those patterns appear. Without explicit counter-pressure (linters, garbage collection, maintained conventions), entropy compounds faster than in human-written codebases because there is no natural judgment filter.
What is 'AI slop'?
Code that passes review — compiles, passes tests, correct functionality — but introduces subtle quality degradation: unnecessary abstractions, verbose patterns, duplicate logic, inconsistent conventions. AI slop doesn't fail review because it looks intentional and consistent, but it accumulates as a maintenance burden that slows future development.
How do I know if my codebase has AI coding entropy?
Warning signs: multiple implementations of the same concern with no clear canonical version; code reviewers leaving the same comments repeatedly on AI-generated PRs; PRs that are 'correct but something feels off'; convention docs that haven't been updated in months despite codebase evolution; growing tech debt labelled 'AI-generated.'
Does a harness guarantee high-quality AI output?
No — the harness is not a quality guarantee. It is an amplifier: it ensures that whatever quality is encoded in the harness is applied consistently across every agent run. A well-designed harness amplifies good patterns at scale. A poorly designed or incomplete harness amplifies whatever the codebase already contained — good and bad.
Legacy codebases and team adoption
Does harness engineering work on legacy codebases?
It is harder on legacy codebases than on greenfield ones. Birgitta Böckeler notes directly that retrofitting a harness onto a non-standardised legacy codebase may not be worth the effort — similar to running static analysis on a codebase that has never had it and drowning in alerts. The practical approach: start with new modules or significantly refactored domains, then expand coverage incrementally.
How do I introduce harness engineering to a team that already uses AI agents?
Start with an audit: what are the three most common reviewer comments on AI-generated PRs? Those are your first taste invariants. Write convention docs for them and build linters. Run the existing AI-generated code through the new linters and open GC PRs. Make improvements visible to build team buy-in before expanding scope.
What skills does harness engineering require from engineers?
Systems thinking, the ability to make implicit judgment explicit, and precision in expressing requirements. The ability to write code quickly matters less than the ability to design environments that produce good code reliably. Engineers who struggle are typically those who can't articulate why something is wrong precisely enough for a system to act on it.
How does harness engineering change code review?
Code review shifts from convention enforcement (style, patterns, standards) to logic review (does this solve the right problem, are the trade-offs correct, what could go wrong). Convention enforcement is handled by linters and the harness. Reviewers who no longer need to check for naming conventions have more attention for the reasoning and architecture of the change.
Relation to other disciplines
How does harness engineering relate to platform engineering?
Platform engineering builds internal developer platforms for provisioning, deployment, and observability. Harness engineering builds the operating environment for AI coding agents. They are complementary — a mature platform engineering practice often becomes the substrate for a harness (shared conventions, standardised tooling, policy enforcement). AI agents are becoming first-class users of internal developer platforms.
Is harness engineering relevant for operations and SRE, not just development?
Yes — and increasingly so. AI agents operating in production (for incident response, Day 2 ops, infrastructure changes) need a harness that encodes operational judgment: what actions require approval, what policies must hold, what the audit trail requires. The harness for operational agents is built from policy, runbooks, and guardrails rather than coding conventions and linters.