How to use this checklist
Work through the phases in order. Each phase builds on the last. Don't skip to phase 3 before phase 1 is solid — a harness built out of order produces gaps that are hard to diagnose later.
Foundation — before the agent writes anything
1. Create AGENTS.md as an index, not a document
The file should be under 150 lines. It covers: what the codebase is, the tech stack, how to run it, a summary of conventions, and pointers to deeper docs. It should not contain full rule explanations — those live in the docs/ directory it references.
2. Build a docs/ directory with at least three convention files
Start with the three conventions most commonly violated in code review: naming, error handling, and logging. Each file: rule stated first, one positive example, one counter-example. Under two pages. Reviewed by a tech lead.
3. Assign an owner to every docs/ file
Add a header block with owner, last-reviewed date, and review-interval. No owner means no one is responsible for catching staleness. A stale docs file is worse than no docs file — the agent follows it confidently and produces consistently wrong output.
4. Audit the codebase for inconsistent patterns
Before the agent starts, identify which patterns it will replicate at scale. If there are three ways of doing error handling, the agent will use all three proportionally. Decide which is canonical and document it. Deprecate the others explicitly.
Enforcement — make rules mechanical
5. Convert your top three conventions into linter rules
If a convention matters enough to document, it matters enough to mechanically enforce. Pick the three that are most commonly violated and write linter rules for them. They should run on every CI build and on every agent-generated PR. Linters are the only harness mechanism the agent cannot bypass.
6. Add a structural test for architectural boundaries
Module boundaries, dependency directions, import restrictions. If your architecture says the database layer should never be imported from the presentation layer, there should be a test that fails if it is. Not a convention — a hard check.
7. Verify linters run on agent-generated PRs, not just developer PRs
Some teams have CI configs that skip linter runs for specific PR authors or branch patterns. Check that agent PRs go through the same checks as human PRs. The harness only works if enforcement is consistent.
8. Set up a code review template that asks: did the agent follow conventions?
Add a single checkbox to the PR template: 'I verified this PR follows the conventions in docs/.' This creates a forcing function for reviewers to check agent output against the harness, not just for functional correctness.
Task design — make agent input machine-actionable
9. Write acceptance criteria in the format: given / when / then
Vague task descriptions produce vague output. The agent needs to know what state the world is in before the task, what action triggers the expected outcome, and what the outcome looks like. If you cannot express a task in given/when/then, the specification is not complete enough for an agent.
10. Include a reproduction step for every bug fix task
The OpenAI harness team required that the agent could reproduce a bug before writing a fix. An agent that cannot reproduce the bug is guessing at the fix. If the task description doesn't include a reproduction path, the agent will write a fix that may not address the actual issue.
11. Test that the agent can merge without human intervention on a simple task
Pick a low-risk, well-specified task and let the agent take it from description through merged PR without human intervention. If it fails, identify what was missing from the harness — not from the agent. The failure mode tells you what to add to the docs, linters, or task template.
Maintenance — keep the harness from decaying
12. Schedule a weekly garbage collection agent run
One agent, one convention, one PR. The agent scans the codebase for violations of a specific documented convention and opens small targeted PRs to fix them. Each PR should be reviewable in under a minute. This prevents entropy from accumulating silently.
13. Add a docs/ review checklist to every PR that changes a pattern
Any PR that changes how the codebase handles a concern covered in docs/ should update the relevant doc file. Add this to your PR template as a checkbox. Without this, conventions evolve in the code while the harness docs fall behind.
14. Run a harness review every 90 days
Read through AGENTS.md and all docs/ files. Check: are the rules still accurate? Are there new patterns that need documentation? Are there patterns being documented that are no longer relevant? A quarterly review session of 2–3 hours prevents the harness from drifting into uselessness.
15. Track: are repeated reviewer comments decreasing over time?
The signal that a harness is working is that code review catches fewer convention violations over time — not because reviewers are less careful, but because the agent is following conventions better. If reviewer comments repeat, the harness is missing a rule. Add it.
Where most teams are in practice
| Stage | Typical checklist coverage | Common result |
|---|---|---|
| Just getting started | 1–3 (AGENTS.md exists, no docs/) | Agent follows some conventions, misses others inconsistently |
| Active adoption | 1–8 (foundation + some linters) | Quality improving but entropy accumulating in unlintered areas |
| Mature harness | 1–12 (enforcement + GC running) | Agent output increasingly reliable; humans reviewing logic not style |
| Compounding returns | 1–15 (full harness + metrics) | Convention violations decrease over time; harness self-improves |
The one thing to do today
If you're not sure where to start: open a new file called AGENTS.md in your repository root and write down the three things a senior engineer checks for in every code review. Those are your first taste invariants. Make them explicit, concrete, and include one counter-example each. That's the beginning of a harness. Everything else on this checklist builds from there.
Related reading: AGENTS.md: the complete field guide, AI coding entropy, and AI didn't remove engineering judgment — it moved it upstream.