Blog · AI & platform

The Complete Guide to Cutting AI Agent Token Costs

Token costs are the hidden tax on every production AI agent. They start invisible in prototypes and become the dominant cost line in production. This guide covers eight techniques that reduce token consumption without cutting capability — and explains which ones compound over time.

Why token costs matter at production scale

A prototype agent with a 10K-token system prompt costs almost nothing per run. The same agent at 10,000 runs per day costs real money — and if each run touches multiple tools, generates multi-turn conversations, and receives large context windows of retrieved data, the cost per run multiplies.

The math that catches most teams off-guard: costs scale with both the number of runs and the tokens per run. Doubling agent usage doubles the bill. But bloated prompts mean every new user multiplies a large base. A 50% reduction in tokens per run is equivalent to halving infrastructure costs — and unlike infrastructure, token reduction often also improves quality by removing noise from the context.

Cost sensitivity: tokens × runs × model price

Baseline: 20K tokens/run × 1K runs/day20M tokens/day
After progressive disclosure (−40%)12M tokens/day
After prompt caching (−60% of repeated input)~8M billed tokens/day
Combined (compounding)>60% cost reduction

Technique 1: Progressive disclosure

The most impactful single technique. Instead of loading all instructions, all tools, and all context into every agent call, structure information into layers and load only what the current task needs.

L1 — Always loadedCore identity, the tools available (names + one-line descriptions), safety rules. Under 500 tokens.
L2 — Loaded when relevantFull instructions for the specific task or tool the agent is actually using. Triggered by task type or tool selection.
L3 — Retrieved on demandReference material, documentation, examples, historical context. Fetched only when the agent explicitly needs it.

The mental model: a whiteboard (L1) tells the agent what tools are available. The filing cabinet (L2/L3) holds the full instructions. The agent checks the whiteboard first, then fetches details from the cabinet only for what it's doing right now.

Token impact: teams implementing progressive disclosure typically see 30–60% reduction in average tokens per run, because most runs only use a subset of available capabilities.

Technique 2: Skills over mega-prompts

A mega-prompt tries to cover every possible task the agent might face. It's loaded every time, costs tokens every time, and most of it is irrelevant to any given run. Skills replace the mega-prompt with targeted, reusable operating procedures loaded only when a specific task type is triggered.

A skill is a packaged procedure: trigger conditions, required context, step-by-step process, decision checkpoints, and quality criteria. An agent with 20 skills loads one skill per task, not all 20. The token cost is proportional to the task — not to the full breadth of everything the agent could do.

The compounding benefit: skills also improve quality. A targeted procedure is more precise than a catch-all prompt, and the agent has fewer irrelevant instructions competing for attention.

Technique 3: Prompt caching

Major LLM providers (Anthropic, OpenAI, Google) offer prompt caching: if the beginning of a prompt is identical across requests, cached tokens are billed at a fraction of the standard rate (typically 10–25% of input token cost).

To maximise cache hit rate:

  • Put static content (system prompt, tool definitions, stable instructions) at the start of the prompt, before dynamic content.
  • Keep the static prefix as long as possible — longer cache prefixes mean more savings per request.
  • Avoid putting timestamps, session IDs, or any per-request content in the cached section.
  • Make cache breakpoints explicit where supported (Anthropic's cache_control markers).

For agents where the system prompt constitutes 40–70% of input tokens, prompt caching alone can reduce effective input costs by 50–80% at high request volumes.

Technique 4: Context compaction

In multi-turn agent conversations, context grows with every exchange. Without management, a session that starts at 2K tokens reaches 100K tokens after enough turns. Context compaction summarises earlier turns into a compressed representation that preserves relevant state without keeping the full conversation history.

Compaction strategies by approach:

StrategyHow it worksBest for
Rolling summarySummarise turns older than N into a compact state blockLong sessions with stable goals
Entity extractionMaintain a structured state object (entities, decisions, actions taken)Structured workflows with clear state
Sliding windowKeep only the last N turns verbatimShort-horizon tasks where recent context dominates
Checkpoint and resetCompact the full context at task boundariesMulti-step tasks with clear phases

Technique 5: Lean tool design

Tool definitions consume tokens every time they appear in the context. An agent with 30 tools, each with a verbose description and parameter schema, may spend 5–10K tokens on tool definitions alone — before any task context is added.

Token-efficient tool design principles:

  • One-line descriptions. The tool description in the context should be the minimum needed for routing. Detailed usage instructions belong in L2/L3 skill content loaded only when the tool is selected.
  • Prune unused tools per task. Don't load all tools for every call. Load only the tools relevant to the current task category.
  • Group related tools. A single "read_data" tool with a type parameter is more token-efficient than five separate read tools with individual definitions.
  • Avoid redundant parameter documentation. If the parameter name is self-documenting (user_id, start_date), the description can be minimal or omitted.

Technique 6: Model routing

Not every task needs the most capable (and most expensive) model. A production agent system that routes tasks to the appropriate model tier can reduce costs by 60–80% on the tasks that don't require frontier reasoning.

Frontier model (e.g. Opus, GPT-4o)

Complex reasoning, ambiguous instructions, novel problems, high-stakes decisions

Mid-tier model (e.g. Sonnet, GPT-4o-mini)

Standard task execution, code generation with clear specs, structured data extraction

Small/fast model (e.g. Haiku, GPT-3.5-turbo)

Routing decisions, simple classification, format conversion, confirmation steps

The routing decision itself can be made by a small model or a rule-based classifier, adding minimal overhead while enabling order-of-magnitude cost differences on the routed calls.

Technique 7: Batching and async

Many agent tasks are not time-sensitive. Background analysis, report generation, nightly summaries, and non-urgent automation can be batched and processed asynchronously at off-peak times or via batch API endpoints.

Anthropic's Message Batches API, for example, offers 50% cost reduction on requests that can tolerate up to 24-hour latency. For agents running large-scale analysis or non-realtime workflows, batching is a structural cost reduction that requires no change to the agent logic — only to the delivery mechanism.

Technique 8: Token budgets and circuit breakers

Without explicit budgets, agents can spiral: a task that should take 5K tokens ends up consuming 80K because the agent keeps trying approaches, accumulating tool results, and expanding its context. Budget controls prevent runaway spend.

Three budget mechanisms:

  • Per-task token budget. Set a maximum token budget per task type. When the budget is reached, the agent must either complete with what it has or escalate to a human rather than continuing to expand context.
  • Budget awareness in the prompt. Anthropic's extended thinking API supports explicit budget_tokens parameters. More broadly, telling the agent its token budget changes its behaviour — agents given explicit budgets tend to be more concise.
  • Circuit breakers at the harness level. Instrumentation that detects when a session has exceeded expected token consumption and pauses execution pending human review. Especially important for agentic loops where a misguided agent could consume unlimited tokens.

Which techniques compound

TechniqueTypical reductionCompounds withEffort
Progressive disclosure30–60%Skills, cachingMedium
Skills20–50%Progressive disclosureMedium
Prompt caching40–80% of cached tokensProgressive disclosureLow
Context compaction20–70% in long sessionsAll techniquesMedium
Lean tool design5–20%Progressive disclosureLow
Model routing50–80% on routed tasksAll techniquesHigh
Batching50% (batch API)Model routingLow
Token budgetsPrevents runaway (tail cost)All techniquesLow

The highest-leverage starting point for most teams: implement prompt caching (low effort, high impact on any agent with a large static system prompt) and progressive disclosure (medium effort, compounds with almost everything else). These two alone typically achieve 50–70% cost reduction.

Frequently asked questions

Does reducing tokens reduce quality?

Not usually — often the opposite. Noise in the context window degrades quality. Removing irrelevant instructions, compacting stale conversation history, and loading only task-relevant tools gives the model a cleaner signal. Quality problems from token reduction usually come from removing genuinely necessary context, not from removing bloat.

How do I measure baseline token usage before optimising?

Log token counts (input, output, cached) per request at the harness level. Segment by task type and agent. Identify the highest-spend task types first — a 40% reduction on a task that represents 60% of your volume is more valuable than a 90% reduction on a rare task.

Which model providers support prompt caching?

Anthropic (Claude) supports explicit cache_control markers for fine-grained cache control. OpenAI automatically caches the longest matching prefix for eligible models. Google Gemini supports context caching with explicit cache objects. Implementation details vary — check current provider documentation.

How do token costs relate to agent latency?

Fewer input tokens generally means lower latency (less to process before generating). Shorter outputs reduce latency proportionally. Model routing to smaller models reduces latency dramatically. Progressive disclosure and context compaction both reduce latency and cost simultaneously.

Key takeaway

Token costs are an engineering problem, not a budget problem. The solution is architectural: a harness that loads the right information at the right granularity, caches the static parts, routes to the right model, and enforces spending limits before sessions spiral. Start with caching and progressive disclosure — both have immediate payback and compound with every other technique.

Related reading: your AI agent is burning money — and the fix, agent loops, tokenomics, and the harness, and skills make judgement reusable.