Blog · AI & platform

The Harness Engineering Checklist

Before you trust AI-generated code in production, the harness needs to be in place. Not some of it — the critical parts. This checklist covers the 15 things teams miss most often, organised by the order to tackle them.

How to use this checklist

Work through the phases in order. Each phase builds on the last. Don't skip to phase 3 before phase 1 is solid — a harness built out of order produces gaps that are hard to diagnose later.

Phase 1

Foundation — before the agent writes anything

1. Create AGENTS.md as an index, not a document

The file should be under 150 lines. It covers: what the codebase is, the tech stack, how to run it, a summary of conventions, and pointers to deeper docs. It should not contain full rule explanations — those live in the docs/ directory it references.

See: AGENTS.md field guide

2. Build a docs/ directory with at least three convention files

Start with the three conventions most commonly violated in code review: naming, error handling, and logging. Each file: rule stated first, one positive example, one counter-example. Under two pages. Reviewed by a tech lead.

See: knowledge architecture guide

3. Assign an owner to every docs/ file

Add a header block with owner, last-reviewed date, and review-interval. No owner means no one is responsible for catching staleness. A stale docs file is worse than no docs file — the agent follows it confidently and produces consistently wrong output.

4. Audit the codebase for inconsistent patterns

Before the agent starts, identify which patterns it will replicate at scale. If there are three ways of doing error handling, the agent will use all three proportionally. Decide which is canonical and document it. Deprecate the others explicitly.

Phase 2

Enforcement — make rules mechanical

5. Convert your top three conventions into linter rules

If a convention matters enough to document, it matters enough to mechanically enforce. Pick the three that are most commonly violated and write linter rules for them. They should run on every CI build and on every agent-generated PR. Linters are the only harness mechanism the agent cannot bypass.

6. Add a structural test for architectural boundaries

Module boundaries, dependency directions, import restrictions. If your architecture says the database layer should never be imported from the presentation layer, there should be a test that fails if it is. Not a convention — a hard check.

7. Verify linters run on agent-generated PRs, not just developer PRs

Some teams have CI configs that skip linter runs for specific PR authors or branch patterns. Check that agent PRs go through the same checks as human PRs. The harness only works if enforcement is consistent.

8. Set up a code review template that asks: did the agent follow conventions?

Add a single checkbox to the PR template: 'I verified this PR follows the conventions in docs/.' This creates a forcing function for reviewers to check agent output against the harness, not just for functional correctness.

Phase 3

Task design — make agent input machine-actionable

9. Write acceptance criteria in the format: given / when / then

Vague task descriptions produce vague output. The agent needs to know what state the world is in before the task, what action triggers the expected outcome, and what the outcome looks like. If you cannot express a task in given/when/then, the specification is not complete enough for an agent.

10. Include a reproduction step for every bug fix task

The OpenAI harness team required that the agent could reproduce a bug before writing a fix. An agent that cannot reproduce the bug is guessing at the fix. If the task description doesn't include a reproduction path, the agent will write a fix that may not address the actual issue.

11. Test that the agent can merge without human intervention on a simple task

Pick a low-risk, well-specified task and let the agent take it from description through merged PR without human intervention. If it fails, identify what was missing from the harness — not from the agent. The failure mode tells you what to add to the docs, linters, or task template.

Phase 4

Maintenance — keep the harness from decaying

12. Schedule a weekly garbage collection agent run

One agent, one convention, one PR. The agent scans the codebase for violations of a specific documented convention and opens small targeted PRs to fix them. Each PR should be reviewable in under a minute. This prevents entropy from accumulating silently.

See: AI coding entropy guide

13. Add a docs/ review checklist to every PR that changes a pattern

Any PR that changes how the codebase handles a concern covered in docs/ should update the relevant doc file. Add this to your PR template as a checkbox. Without this, conventions evolve in the code while the harness docs fall behind.

14. Run a harness review every 90 days

Read through AGENTS.md and all docs/ files. Check: are the rules still accurate? Are there new patterns that need documentation? Are there patterns being documented that are no longer relevant? A quarterly review session of 2–3 hours prevents the harness from drifting into uselessness.

15. Track: are repeated reviewer comments decreasing over time?

The signal that a harness is working is that code review catches fewer convention violations over time — not because reviewers are less careful, but because the agent is following conventions better. If reviewer comments repeat, the harness is missing a rule. Add it.

Where most teams are in practice

StageTypical checklist coverageCommon result
Just getting started1–3 (AGENTS.md exists, no docs/)Agent follows some conventions, misses others inconsistently
Active adoption1–8 (foundation + some linters)Quality improving but entropy accumulating in unlintered areas
Mature harness1–12 (enforcement + GC running)Agent output increasingly reliable; humans reviewing logic not style
Compounding returns1–15 (full harness + metrics)Convention violations decrease over time; harness self-improves

The one thing to do today

If you're not sure where to start: open a new file called AGENTS.md in your repository root and write down the three things a senior engineer checks for in every code review. Those are your first taste invariants. Make them explicit, concrete, and include one counter-example each. That's the beginning of a harness. Everything else on this checklist builds from there.

Related reading: AGENTS.md: the complete field guide, AI coding entropy, and AI didn't remove engineering judgment — it moved it upstream.