AI API costs don't scale linearly with usage — they scale with bad habits. The teams paying the most aren't using the API the most. They're using it the least efficiently.
Here are seven techniques that production teams use to cut costs by 60-80% without degrading output quality.
1. Model Routing (saves 60-80%)
The single biggest lever. Most requests in any production system don't need your flagship model. A tiered routing approach sends:
- Tier 1 — Simple tasks (classification, extraction, yes/no): GPT-4o mini or Claude Haiku. Cost: $0.15-0.80/M tokens.
- Tier 2 — Standard tasks (drafting, summarization): GPT-4o or Claude Sonnet. Cost: $2.50-3.00/M tokens.
- Tier 3 — Complex tasks (deep reasoning, code): o3-mini or Claude Opus. Cost: $6-15/M tokens.
Teams that implement routing typically find 85-90% of requests comfortably handled by Tier 1.
2. Prompt Caching (saves 40-60% on cached tokens)
Both OpenAI and Anthropic offer discounts for repeated prompt prefixes:
| Provider | Cache discount | Min prefix size |
|---|---|---|
| OpenAI | 50% off | 1,024 tokens |
| Anthropic | 90% off (read), write at 25% premium | 1,024 tokens |
Structure your prompts: static content first (system instructions, examples, context docs), dynamic content last (user input). Only the static part gets cached.
3. Token Budget Caps (saves 30-50% on output)
Output tokens cost 3-5x more than input tokens. Most teams never set max_tokens. They should.
Recommended budgets per task type:
- Binary classification: 5 tokens
- Short extraction: 50 tokens
- Customer support reply: 200 tokens
- Blog outline: 400 tokens
- Full document: 800 tokens
A model with no cap that "explains its reasoning" will output 4x more tokens than asked. Caps prevent runaway costs on edge cases.
4. Conversation History Compression (saves 40-60%)
In multi-turn conversations, the context window grows with every message. By turn 8, you're paying for the entire history on every API call.
Fix: After every 4-6 turns, inject a summary of prior turns and drop the raw history. The model gets context; you save 50-70% on token count.
5. Semantic Caching (saves 20-40%)
Store the embedding of each user query alongside the API response. On new queries, compute the embedding, check for similarity (cosine > 0.95), and return the cached result if found.
Works best for:
- FAQ chatbots (same questions asked repeatedly)
- Search-like queries
- Template-based generation
Libraries: GPTCache, LangChain's semantic caching, or build your own with pgvector.
6. Batch API for Non-Realtime Work (saves 50%)
OpenAI's Batch API charges 50% less for requests that don't need immediate responses (24h window). Anthropic offers similar async pricing.
Ideal for:
- Bulk content generation
- Nightly data enrichment
- Embedding large document sets
- Evaluation runs
7. Structured Output + Shorter Prompts (saves 20-30%)
Verbose prompts waste input tokens. Long JSON schemas in every request add up. Refactoring prompts from paragraph instructions to concise bullet-point directives typically reduces input tokens by 20-35%.
Use structured output modes (JSON mode, function calling) to get machine-parseable results without asking models to "format your response as JSON" in the prompt.
Putting It Together
A team using all seven techniques on a $10,000/month API bill can realistically reach $2,000-4,000/month with equivalent output quality. The order of implementation by ROI:
- Model routing (biggest single impact)
- max_tokens caps (immediate, no code change needed)
- Prompt caching (requires prompt restructuring)
- Conversation compression (requires middleware)
- Semantic caching (requires infrastructure)
- Batch API (requires workflow separation)
- Prompt compression (requires rewriting)
Use the AI Inference Cost Calculator to model your current spend before starting.