The average team overspends on LLM APIs by 40–60% — not because their product uses too much AI, but because they haven't applied basic cost optimization techniques. Here are seven that work in production.
1. Enable Prompt Caching (Save 50–90% on System Prompts)
Both OpenAI and Anthropic offer significant discounts on repeated input tokens.
- OpenAI: Automatically caches prompts that repeat the same prefix. Cached tokens cost $1.25/1M (50% off GPT-4o's $2.50/1M)
- Anthropic:
cache_controlparameter on system prompt blocks. Cached tokens cost $0.30/1M (90% off Sonnet 4's $3/1M)
Impact: If your system prompt is 2,000 tokens and you make 50,000 requests/month on Claude Sonnet 4:
- Without caching: $300/month on system prompt tokens
- With caching: $30/month — $270 saved
Enable caching for any static content: system prompts, tool schemas, few-shot examples.
2. Use the Batch API for Non-Real-Time Tasks (Save 50%)
OpenAI's Batch API and Anthropic's Message Batches API both offer 50% off for requests that don't need immediate responses.
Good batch candidates:
- Content moderation queues
- Document classification
- Embeddings generation
- Nightly report generation
- Offline data enrichment
Bad candidates: Anything user-facing where latency matters.
For a workload of 1M tokens/month on GPT-4o mini:
- Real-time: $0.15/1M input = $150/month
- Batch API: $0.075/1M input = $75/month
Use the Batch API Cost Calculator to model your specific workload.
3. Route by Complexity (Save 60–80% on Simple Queries)
Not every query needs GPT-4o. A simple yes/no classification doesn't need the same model as writing a legal summary.
Three-tier routing pattern:
| Tier | Model | Use for | Cost/1M tokens |
|---|---|---|---|
| Fast | GPT-4o mini / Claude Haiku 4 | Classification, extraction, simple Q&A | $0.15–0.80 |
| Standard | GPT-4o / Claude Sonnet 4 | Most production tasks | $2.50–3.00 |
| Premium | GPT-4o / Claude Opus 4 | Complex reasoning, code generation | $15–75 |
Route 70% of queries to the fast tier, 25% to standard, 5% to premium. The average cost per query drops dramatically.
Simple routing logic:
def route_query(query: str, context_length: int) -> str:
if len(query) < 100 and context_length < 500:
return "gpt-4o-mini"
elif context_length > 10000:
return "claude-sonnet-4" # better long-context handling
else:
return "gpt-4o"
4. Compress Your Prompts (Save 20–40%)
Long system prompts are expensive to repeat. Compression techniques that preserve instruction quality:
- Remove filler phrases: "Please make sure to always..." → "Always..."
- Use structured formats: Bullet lists and headers tokenize more efficiently than paragraphs
- Eliminate redundancy: Don't repeat constraints already implied by the task
- Use XML tags:
<instructions>blocks help models parse efficiently with fewer tokens
A 3,000-token system prompt compressed to 1,800 tokens saves 40% on every uncached request. Use the AI Prompt Cost Optimizer to model your savings.
5. Set Max Tokens Limits Aggressively
The most overlooked cost lever: models generate tokens up to max_tokens if not constrained. Bloated responses happen when you don't set limits.
Before: max_tokens: 4096 for a Q&A bot that typically needs 150 tokens
After: max_tokens: 300 with a truncation note in the system prompt
If your average response is 300 tokens but max_tokens is 4096, you're paying for the risk of a long response — not the actual response. In practice, models rarely generate maximum output, but the safety margin costs money.
Analyze your actual p95 output length in logs and set max_tokens to 150% of that.
6. Cache Responses for Repeated Queries (Save 40–70%)
LLM response caching is different from prompt caching. For queries where the same input generates the same output:
- Semantic similarity cache (Redis + embeddings): Cache responses for queries within cosine distance 0.95
- Exact match cache: Hash the full prompt; skip the API call entirely
For a customer support bot, 30–50% of queries are common repeats ("How do I cancel?", "Where's my order?"). Caching these saves their full cost.
Implementation: Embed the user query, search a vector store of past (query, response) pairs, return the cached response if similarity > 0.95.
7. Reduce Context Window Usage
Each token in the conversation history costs money. Common mistakes:
- Sending full chat history every request: Summarize older turns instead ("User and assistant discussed X, Y, Z in previous turns...")
- Including full retrieved documents: Send only the relevant paragraph, not the full page
- Verbose tool responses: Trim API responses before injecting into context
Cutting average context from 4,000 to 2,500 tokens is a 37.5% cost reduction — with zero impact on user experience in most cases.
Combined Impact
Apply all seven techniques to a $5,000/month API bill:
| Technique | Savings |
|---|---|
| Prompt caching | −$800 |
| Batch API (30% of workload) | −$450 |
| Model routing | −$1,200 |
| Prompt compression | −$300 |
| Max tokens limiting | −$200 |
| Response caching | −$400 |
| Context reduction | −$350 |
| Total | −$3,700 (74% reduction) |
The techniques compound. Start with prompt caching (easiest, highest ROI) and model routing (most impactful), then layer in the rest.