The gap between AI API bills for teams building the same product can be 3-5x, driven entirely by optimization practices. This guide covers the highest-impact techniques with implementation specifics.
The Token Cost Baseline
Before optimizing, understand where your tokens go:
| Component | Typical % of total tokens |
|---|---|
| System prompt | 15-30% |
| Conversation history | 25-40% |
| User input | 10-20% |
| Retrieved context (RAG) | 15-30% |
| Output tokens | 10-25% |
The insight: Most optimization opportunity lies in system prompts and conversation history — not in user inputs or outputs.
Technique 1: System Prompt Compression
Verbose system prompts are the most common waste. Before:
You are a helpful customer support assistant for AcmeCorp. Your job is to
help customers with any questions they have about our products and services.
You should always be polite and professional. You should try to answer
questions accurately and helpfully. If you don't know the answer, you should
say so rather than making something up...
(~70 tokens)
After:
AcmeCorp support. Answer accurately, briefly. Unknown = say so. Escalate billing to human.
(~15 tokens)
Savings at scale: 55 tokens × 1,000,000 queries × $0.003/1K tokens = $165/month from one prompt edit.
Technique 2: Intelligent Model Routing
Not all tasks require the most capable (expensive) model. Route by task complexity:
| Task type | Appropriate model | Cost multiplier |
|---|---|---|
| Simple Q&A, classification | GPT-4o-mini / Claude Haiku | 1x (base) |
| Summarization, extraction | GPT-4o-mini / Claude Haiku | 1x |
| Moderate reasoning | GPT-4o / Claude Sonnet | 10-15x |
| Complex analysis, code | GPT-4o / Claude Sonnet | 10-15x |
| Frontier tasks only | o3 / Claude Opus | 30-100x |
Implementation example using a classifier:
- Run cheap classification model (GPT-4o-mini) to score task complexity (1-5)
- Route 1-2 to Haiku/mini, 3-4 to Sonnet/4o, 5 only to Opus/o3
- Net result: 60-80% of requests go to cheap model, 15-30% to mid-tier, 5% to expensive
Typical savings: 50-70% on API costs with no quality degradation for simple tasks.
Technique 3: Response Caching
Cache semantically equivalent queries:
Exact caching: Store response by SHA-256 hash of prompt. Instant for identical queries.
Semantic caching: Use embedding similarity to find "close enough" prior responses.
- Cosine similarity > 0.95 = return cached response
- Tools: Redis with vector search, Pinecone, Weaviate
Implementation:
def get_response(prompt, threshold=0.95):
embedding = embed(prompt)
similar = vector_db.search(embedding, limit=1)
if similar and similar[0].score > threshold:
return similar[0].cached_response
response = llm.complete(prompt)
vector_db.upsert(embedding, response)
return response
Cache hit rates for production apps:
- FAQ/support systems: 30-60% hit rate
- Search features: 20-40% hit rate
- Creative/unique tasks: 5-15% hit rate
Technique 4: Conversation History Pruning
In multi-turn conversations, naive implementations send the full history each turn:
Turn 1: 500 tokens Turn 2: 500 + 800 = 1,300 tokens input Turn 3: 1,300 + 700 = 2,000 tokens input ...
By turn 10: 8,000+ tokens per request just for history.
Pruning strategies:
- Rolling window: Keep only the last N messages (e.g., 6)
- Summarize + recent: Compress old messages to a summary, keep recent 4-6 in full
- Relevance filter: Only include messages semantically relevant to current query
Most chatbot applications see 40-60% token reduction from rolling window alone.
Technique 5: Structured Output Compression
When you need structured data, JSON output can be verbose:
{
"customer_sentiment": "positive",
"confidence_score": 0.87,
"key_themes": ["product_quality", "fast_delivery"],
"recommended_action": "no_action_needed"
}
Alternative: instruct the model to return compact format and parse it:
pos|0.87|quality,delivery|none
Then parse with regex. Input tokens often don't change, but output tokens can drop 60-70%.
Technique 6: Batch API Processing
For non-real-time workloads, use batch APIs:
| Provider | Batch discount | Min batch size |
|---|---|---|
| OpenAI Batch API | 50% off | 1 request |
| Anthropic Message Batches | 50% off | 1 request |
If your use case can tolerate 24-hour latency (document processing, analytics, classification runs), batch API cuts costs exactly in half.
Technique 7: Prompt Caching for Repeated Context
For applications with long, repeated system prompts or large context:
| Provider | Cache pricing | TTL |
|---|---|---|
| Anthropic | 90% off cached tokens | 5 minutes |
| OpenAI | 50% off cached tokens | Varies |
For a system prompt of 5,000 tokens sent 100,000 times/month:
- Without caching: 500M tokens × $3/1M = $1,500
- With Anthropic caching: First request full price, subsequent 90% off = ~$160/month
Savings: $1,340/month from one implementation.
Combined Impact
Applying all 7 techniques to a production system spending $10,000/month:
| Technique | Cost reduction |
|---|---|
| System prompt compression | -8% |
| Model routing | -45% |
| Response caching (20% hit rate) | -15% |
| History pruning | -12% |
| Output compression | -5% |
| Batch processing (40% eligible) | -20% |
| Prompt caching | -12% |
These stack multiplicatively, not additively. Combined effect for many applications: 70-80% cost reduction.
Use the AI Inference Cost Calculator to model your current spend and projected savings from optimization.