Understanding token costs
Why are AI agent token costs so much higher than chatbot costs?
AI agents consume tokens on three fronts simultaneously: (1) input tokens for the system prompt, tool definitions, and conversation history; (2) output tokens for each response; (3) tool-use tokens for reading tool results back into the context. In a chatbot, the user provides context. In an agent, the harness assembles a large context programmatically on every call. A typical production agent call can consume 10–50x more tokens than a simple chat exchange.
What are input tokens vs output tokens?
Input tokens are everything sent to the model: the system prompt, tool definitions, conversation history, retrieved documents, and tool results. Output tokens are what the model generates in response. Input tokens are typically cheaper than output tokens per unit, but agents tend to have large input contexts — so input cost often dominates.
What is a token, and how does it relate to words?
A token is the unit of text that LLMs process. Roughly speaking, one token equals about 0.75 words in English, or about 4 characters. A 1,000-word document is approximately 1,300 tokens. Code is often more token-dense than prose because of symbols, indentation, and variable names.
How do I measure my current token consumption?
Log token counts at the harness level on every API call: input tokens, output tokens, and (if available) cached tokens. Segment by task type and agent to identify your highest-spend patterns. Most providers return token usage in the API response. The goal is to understand your cost distribution before optimising — a 40% reduction on your highest-spend task type is more valuable than a 90% reduction on a rare one.
What is the biggest single source of token waste in most agents?
The system prompt. Most agents load all instructions, all tool definitions, and all context on every call — regardless of whether the current task uses any of it. A 15,000-token system prompt that covers 40 possible scenarios costs 15,000 tokens even when the current task only needs 2,000 tokens of those instructions.
How do token costs scale with agent usage?
Linearly with both volume and context size. Double the number of agent runs → double the cost. Double the average tokens per run → double the cost. The danger is when both increase simultaneously: a new feature that adds 5,000 tokens to the system prompt and 10x the user base creates a 10x cost increase from the usage growth plus a 30% increase from the prompt bloat — combined effect is roughly 13x.
Reduction techniques
What is progressive disclosure, and how much does it save?
Progressive disclosure loads information in layers: L1 (always loaded — core identity, tool names, safety rules, under 500 tokens), L2 (loaded when relevant — full instructions for the specific task), L3 (retrieved on demand — reference docs, examples). Most agent runs only use a subset of capabilities, so loading only what's needed for each run reduces average tokens by 30–60%.
What is prompt caching, and how much does it save?
Prompt caching is a feature from Anthropic, OpenAI, and Google that bills repeated identical prompt prefixes at 10–25% of the standard input token rate. When the first part of your prompt (system prompt, tool definitions) is identical across requests, cached tokens cost a fraction of normal. For agents where the system prompt is 40–70% of input tokens, caching alone can reduce effective input costs by 50–80% at scale.
How do I maximise prompt cache hit rates?
Place all static content (system prompt, tool definitions, stable instructions) at the very beginning of the prompt, before any dynamic content. Keep the static prefix as long as possible — longer cached prefixes mean more savings per request. Never put per-request content (timestamps, session IDs, user-specific data) before the cacheable content. Use explicit cache_control markers where supported (Anthropic's API).
What are skills, and how do they reduce token costs?
Skills are reusable operating procedures for specific task types — trigger conditions, required context, step-by-step process, and quality criteria packaged together. An agent with 20 skills loads one skill per task instead of all 20. The token cost scales with task complexity, not with the total breadth of agent capabilities. Skills also improve quality by giving the agent precise, targeted instructions rather than a catch-all prompt.
What is context compaction?
Context compaction summarises older turns in a multi-turn agent session into a compressed representation, replacing the full conversation history with a shorter state description. Without compaction, a long session grows linearly in token cost. With compaction, earlier turns are replaced by summaries that cost a fraction of the original. For sessions with many turns, compaction can reduce context size by 50–80%.
How does model routing reduce costs?
Model routing directs different task types to different model tiers: simple routing or classification to small/fast/cheap models; standard execution to mid-tier models; complex reasoning to frontier models. Because frontier models cost 10–50x more per token than small models, routing even 50% of tasks to smaller models can reduce average cost per run by 40–60%.
What is the batch API, and when should I use it?
Major providers offer batch endpoints (Anthropic's Message Batches API, OpenAI's Batch API) that process requests with up to 24-hour latency at 50% of the standard price. Use it for any agent task that is not time-sensitive: background analysis, report generation, nightly summaries, non-urgent classification. Requires no change to agent logic — only to how requests are submitted.
How do lean tool definitions reduce costs?
Tool definitions appear in the context on every call. An agent with 30 verbose tool definitions can spend 5,000–10,000 tokens on definitions alone before any task content. Lean tool design: one-line descriptions in the context (detailed instructions load via progressive disclosure when the tool is used), prune unused tools per task type, and group related tools to reduce definition count.
Token budgets and controls
What is a token budget for an AI agent?
A maximum token limit set per task type that the agent must complete within. When the budget is reached, the agent must either complete with available context or escalate to a human — it cannot continue expanding its context window indefinitely. Token budgets prevent runaway spending where a confused agent keeps trying approaches and accumulating tool results.
Does telling the agent its token budget change its behaviour?
Yes, meaningfully. Anthropic's extended thinking API supports explicit budget_tokens parameters. More broadly, agents given explicit budget constraints in their instructions tend to be more concise in their reasoning and output. The instruction 'you have 2,000 tokens for this task' produces different (and often better) output than no constraint.
What is a circuit breaker in an agent context?
Instrumentation at the harness level that detects when a session has exceeded expected token consumption and pauses execution pending human review. Especially important for agentic loops where a misguided agent could consume unlimited tokens attempting to complete a task it cannot complete. Circuit breakers are a cost control mechanism and a safety mechanism simultaneously.
What is the token budget for Claude's extended thinking feature?
Anthropic's extended thinking feature accepts a budget_tokens parameter that sets the maximum tokens for the model's internal reasoning before producing a response. Larger budgets enable more complex reasoning; smaller budgets are faster and cheaper. The optimal budget depends on task complexity — over-budgeting thinking tokens is a common source of unnecessary cost.
Costs by provider
Which LLM providers offer prompt caching?
Anthropic (Claude models) supports prompt caching with explicit cache_control markers. OpenAI automatically caches the longest matching prefix for eligible GPT-4o and o-series models. Google (Gemini) supports context caching with explicit cache objects and a minimum token threshold. Implementation details vary by provider — check current documentation as pricing and availability change.
Do token costs differ between input and output tokens?
Yes — output tokens are typically priced higher than input tokens per unit. For example, on many Claude models, output tokens cost roughly 3–5x more than input tokens. For most agentic workflows, input tokens dominate due to large context windows, but output-heavy tasks (long code generation, detailed analysis) can shift the balance.
How do cached input tokens compare in price to uncached input tokens?
Anthropic charges approximately 10% of the standard input rate for cache hits (cache read) and 125% of standard for writing to cache (cache write, billed once). OpenAI charges approximately 50% of standard for cached input. Google's pricing varies by model tier. At high volumes, the cache write cost is quickly amortised — after a handful of requests with the same prefix, caching pays for itself.
Practical application
Where should I start if I want to cut costs immediately?
Implement prompt caching first — it is the lowest-effort, highest-impact technique for any agent with a large static system prompt. Enable it on your highest-volume agent, put static content first in the prompt, and measure the cache hit rate. For most teams, this alone produces 30–60% reduction in effective input costs within days.
How do token costs relate to agent latency?
Fewer input tokens generally means lower latency — less to process before generating. Shorter outputs reduce latency proportionally. Model routing to smaller models reduces latency dramatically. Progressive disclosure and context compaction both reduce latency and cost simultaneously. Optimising for tokens is almost always optimising for latency as well.
Can reducing tokens hurt quality?
Only if genuinely necessary context is removed. Removing bloat — irrelevant instructions, verbose tool definitions, full conversation history when a summary suffices — typically improves quality by giving the model a cleaner signal. Quality problems from token reduction come from removing context the model actually needs, which is why measuring before optimising matters.
What is the relationship between token costs and the harness?
The harness is the primary cost lever for production agents. Progressive disclosure, skills, and context scoping are harness design decisions — they determine what information is loaded per call. A well-designed harness is also a token-efficient harness, because it loads only what the current task needs. Token cost is an emergent property of harness architecture.