Cost Optimization for AI Agents: Spending Smart on LLM Calls
By Diesel
architecturecostoptimization
## The $10,000 Surprise
I know someone who shipped an AI agent to production on a Friday afternoon. By Monday morning, the LLM bill was $10,000. The agent worked perfectly. It just worked expensively. Every user interaction triggered an average of 12 LLM calls, each with a 4,000-token system prompt, processing conversations that grew unbounded.
Nobody noticed because the agent was producing correct results. That's the insidious thing about cost in AI systems. Unlike bugs, cost problems don't trigger error alerts. They trigger invoices.
Let's talk about how to not be that person.
## Understanding the Cost Structure
Before optimizing, understand what you're paying for. LLM pricing has four components:
**Input tokens**: what you send to the model. System prompts, conversation history, tool results. This is typically the largest cost bucket because agent contexts are huge.
**Output tokens**: what the model generates. Usually smaller but more expensive per token.
**Cache read tokens**: when prompt caching is available, repeated prefix tokens cost 90% less. This is free money if your prompts have stable prefixes.
**Cache write tokens**: the first time a cacheable prefix is sent. Slightly more expensive, amortized over subsequent reads.
```typescript
// Example pricing (Claude Sonnet, approximate)
const pricing = {
inputPerMillion: 3.00, // $3.00 per 1M input tokens
outputPerMillion: 15.00, // $15.00 per 1M output tokens
cacheReadPerMillion: 0.30, // $0.30 per 1M cached tokens
cacheWritePerMillion: 3.75, // $3.75 per 1M cache write tokens
};
```
Output tokens cost 5x input tokens. That means a chatty agent that generates long responses costs dramatically more than one that's concise. This isn't a style preference. It's a cost decision.
## Optimization 1: Prompt Caching
If your agent uses the same system prompt across calls (and it should), prompt caching is the single biggest cost reduction available. The system prompt gets cached after the first call. Subsequent calls read from cache at 1/10th the price.
```typescript
class CacheAwareAgent {
private systemPrompt: string;
async run(messages: Message[]): Promise {
return llm.generate({
model: "sonnet",
system: [
{
type: "text",
text: this.systemPrompt,
cache_control: { type: "ephemeral" },
},
],
messages,
});
}
}
``` For a deeper look, see [deployment architecture choices](/blog/agent-deployment-patterns).
The `cache_control` hint tells the API to cache this content. First call pays full price. Every subsequent call within the cache TTL pays 10% on the cached portion. If your system prompt is 3,000 tokens and you make 100 calls, you save roughly $0.80. Scale that to thousands of users and it's material.
Structuring for cache efficiency means putting stable content first. System prompt, tool definitions, static instructions. These rarely change. Put dynamic content (conversation history, current context) after the cached prefix.
## Optimization 2: Model Selection
Not every task needs your most expensive model. I covered the router pattern in a separate article, but the cost implications deserve emphasis.
```typescript
const modelCosts = {
haiku: { input: 0.25, output: 1.25 }, // per million tokens
sonnet: { input: 3.00, output: 15.00 },
opus: { input: 15.00, output: 75.00 },
};
```
Opus costs 60x more than Haiku for output tokens. That's not a rounding error. That's a fundamental architectural decision.
For a typical agent workload, the distribution might be:
- 60% of tasks: simple transforms, formatting, extraction. Haiku handles these fine.
- 30% of tasks: moderate reasoning, code generation, analysis. Sonnet is appropriate.
- 10% of tasks: complex multi-step reasoning, architecture, novel problems. Opus is justified.
If you route everything to Sonnet, your average cost per task is $X. If you route appropriately, it's roughly 0.4X. A 60% reduction just from matching the model to the task.
## Optimization 3: Context Window Management
Context grows with every turn. Every tool result, every assistant response, every user message adds tokens. An unbounded conversation history is an unbounded cost center.
```typescript
class ContextManager {
private maxContextTokens: number;
async prepare(messages: Message[]): Promise {
const currentTokens = countTokens(messages);
if (currentTokens <= this.maxContextTokens) {
return messages;
}
// Strategy 1: Summarize old turns
const summarized = await this.summarizeOldTurns(messages, {
keepRecent: 6,
summaryMaxTokens: 500,
});
if (countTokens(summarized) <= this.maxContextTokens) {
return summarized;
}
// Strategy 2: Drop tool results older than N turns
const pruned = this.dropOldToolResults(summarized, keepRecent: 3);
if (countTokens(pruned) <= this.maxContextTokens) {
return pruned;
}
// Strategy 3: Aggressive truncation
return this.truncate(pruned, this.maxContextTokens);
}
}
```
Set a hard limit on context size. Not the model's limit. YOUR limit. If the model supports 200K tokens, set your limit at 50K. You'll save money on every single call, and the model usually performs better with focused context anyway. The related post on [load balancing to control spend](/blog/load-balancing-ai-agents) goes further on this point.
## Optimization 4: Reducing LLM Roundtrips
Every roundtrip costs tokens. The system prompt gets sent again. The conversation history gets sent again. Reducing roundtrips is multiplicative savings.
```typescript
// Bad: multiple small calls
const format = await llm.generate("What format should this be in?");
const content = await llm.generate(`Generate content in ${format}`);
const review = await llm.generate(`Review this: ${content}`);
// 3 calls, system prompt sent 3 times
// Better: one comprehensive call
const result = await llm.generate(
"Generate content for X. Choose the appropriate format. " +
"Self-review before returning. Return the final version only."
);
// 1 call, system prompt sent once
```
This isn't always possible. Some tasks genuinely need multiple steps. But many agent loops make separate LLM calls for things that could be combined. Planning and first-step execution. Generation and self-review. Classification and routing. Combine what you can.
## Optimization 5: Caching Tool Results
Tools are often called with the same or similar inputs. A database schema doesn't change between calls. A file's contents don't change within a session. Cache aggressively.
```typescript
class CachedToolRunner {
private cache: Map = new Map();
async execute(tool: string, args: Record): Promise {
const key = `${tool}:${stableHash(args)}`;
const cached = this.cache.get(key);
if (cached && cached.expires > Date.now()) {
return cached.result;
}
const result = await this.runner.execute(tool, args);
const ttl = this.getTTL(tool);
if (ttl > 0) {
this.cache.set(key, { result, expires: Date.now() + ttl });
}
return result;
}
private getTTL(tool: string): number {
const ttls: Record = {
"get_schema": 3600_000, // 1 hour (schemas rarely change)
"read_file": 300_000, // 5 minutes
"web_search": 60_000, // 1 minute
"get_current_time": 0, // never cache
};
return ttls[tool] ?? 60_000;
}
}
```
Every cached tool result is a tool call you didn't make and a tool result you didn't send back to the LLM. Both save tokens. Both save time.
## Optimization 6: Output Constraints
Tell the model to be concise. Seriously. It works.
```typescript
const conciseSystemPrompt = `
You are a helpful assistant. Be direct and concise.
- Answer in 1-3 sentences when possible.
- Use bullet points for lists, not paragraphs.
- Skip preambles and qualifiers.
- If the answer is simple, give a simple answer.
`;
```
Reducing average output length from 500 tokens to 200 tokens saves 60% on the most expensive token type. It also makes your agent faster. Users prefer concise answers. Everyone wins.
For structured output, use JSON schemas or XML tags to constrain the format:
```typescript
const response = await llm.generate({
messages: [{ role: "user", content: task }],
response_format: {
type: "json_schema",
json_schema: {
name: "task_result",
schema: {
type: "object",
properties: {
answer: { type: "string", maxLength: 500 },
confidence: { type: "number" },
sources: { type: "array", items: { type: "string" }, maxItems: 3 },
},
required: ["answer", "confidence"],
},
},
},
});
```
## Optimization 7: Batching
If your agent processes multiple items, batch them into a single call instead of making separate calls for each. It is worth reading about [running models locally with Ollama](/blog/ollama-local-llms-agents) alongside this.
```typescript
// Bad: one call per item
for (const doc of documents) {
const summary = await llm.generate(`Summarize: ${doc}`);
summaries.push(summary);
}
// 10 documents = 10 calls = 10x system prompt cost
// Better: batch into one call
const prompt = documents
.map((doc, i) => `Document ${i + 1}:\n${doc}`)
.join("\n\n---\n\n");
const summaries = await llm.generate(
`Summarize each of the following ${documents.length} documents separately. ` +
`Return as JSON array.\n\n${prompt}`
);
// 10 documents = 1 call = 1x system prompt cost
```
There's a limit to how much you can batch before context quality degrades. But for straightforward tasks like classification, summarization, and extraction, batching 5-10 items per call works well.
## The Cost Dashboard
You can't optimize what you don't measure. Build a cost dashboard from day one.
```typescript
const costMetrics = {
// Per-request
costPerRequest: histogram(),
tokensPerRequest: histogram(),
// Per-user
dailyCostPerUser: gauge(),
monthlyBudgetRemaining: gauge(),
// Per-model
callsByModel: counter({ labels: ["model"] }),
costByModel: counter({ labels: ["model"] }),
// Efficiency
cacheHitRate: gauge(),
averageContextUtilization: gauge(), // actual / max tokens
};
```
Set alerts on anomalies. A user suddenly costing 10x their average? An agent making 50 tool calls for a simple task? A cache hit rate that drops from 80% to 20%? These are the signals that prevent Monday morning surprises.
## The Mindset Shift
Cost optimization for AI agents isn't about being cheap. It's about being intentional. Every token you spend should be earning its keep. Every LLM call should be necessary. Every model selection should be justified.
The agents that survive in production aren't the smartest ones. They're the ones that deliver results at a cost their operators can sustain. Build for that.