Part 1: Understanding Your AI Cost Structure
Where Your AI Budget Actually Goes
Most teams that have "an AI cost problem" don't know which calls are expensive. Before optimization, you need visibility:
The typical distribution:
- 20% of API calls → 80% of cost (Pareto applies to AI costs)
- A few expensive use cases (document analysis, complex reasoning) dominate the bill
- Hundreds of cheap use cases (classification, short responses) are individually trivial
Step 1: Instrument everything.
Every AI API call should log:
- Timestamp
- Model used
- Input token count
- Output token count
- Cost (calculate from token counts × pricing)
- Feature/use-case tag
- Response latency
- Whether it was cached
Without this data, optimization is guesswork. With it, you find the 20% of calls causing 80% of spend in about 30 minutes of analysis.
Recommended monitoring setup:
- Send all API call metadata to your analytics platform (BigQuery, Snowflake, Amplitude)
- Build a cost dashboard showing daily spend by model and use case
- Set up alerts when daily spend exceeds threshold ($X per day)
- Track cost per user / cost per business action (not just raw spend)
Cost Calculation Formula
Cost per API call = (Input tokens × Input price/1M) + (Output tokens × Output price/1M)
Example for GPT-4o Mini:
- 500 input tokens × $0.15/M = $0.000075
- 150 output tokens × $0.60/M = $0.000090
- Total: $0.000165 per call
At 1 million calls/day: $165/day, $4,950/month.
At GPT-4o (full):
- 500 input tokens × $2.50/M = $0.00125
- 150 output tokens × $10.00/M = $0.0015
- Total: $0.00275 per call
At 1 million calls/day: $2,750/day, $82,500/month — 17x more expensive for identical inputs.
Part 2: The Optimization Hierarchy
Work in this order — highest impact first:
1. Model Routing (Most Impact: 50-80% Cost Reduction)
The single highest-leverage optimization: use the right model for each task.
The framework:
Classify every task by:
- Complexity (simple/medium/complex)
- Quality requirements (adequate/good/excellent)
- Volume (low/medium/high)
Route accordingly:
- Simple + adequate + high volume → cheapest model (Gemini Flash, GPT-4o Mini)
- Medium + good quality + medium volume → mid-tier (Claude Sonnet, Gemini Pro)
- Complex + excellent quality + low volume → frontier (GPT-4o, Claude Opus)
Implementation:
def select_model(task_type: str, complexity: str) -> str:
routing_table = {
("classification", "simple"): "gpt-4o-mini",
("classification", "complex"): "gpt-4o-mini", # Still cheap enough
("extraction", "simple"): "gpt-4o-mini",
("summarization", "simple"): "gpt-4o-mini",
("summarization", "complex"): "gpt-4o",
("generation", "simple"): "gpt-4o-mini",
("generation", "complex"): "gpt-4o",
("reasoning", "complex"): "claude-opus-3",
("code_review", "complex"): "gpt-4o",
}
return routing_table.get((task_type, complexity), "gpt-4o-mini")
Real-world savings: A customer support AI routing classification to Mini ($0.0002/call) and complex resolution to Sonnet ($0.025/call) at a 90/10 split:
- Before: 100% Sonnet: $25/1000 calls
- After: 90% Mini + 10% Sonnet: $0.18 + $2.50 = $2.68/1000 calls (89% savings)
2. Prompt Caching (20-50% on Input Token Costs)
If you have a long system prompt that's the same across many requests, you're paying full price for it every time.
Anthropic Prompt Caching:
- Cache control: mark your system prompt for caching
- Cached tokens: $0.30/M input (vs. $3.00/M for uncached) — 90% savings
- Cache TTL: 5 minutes (resets with each use)
OpenAI Prompt Caching:
- Automatic for prompts >1024 tokens that share a prefix
- 50% discount on cached input tokens
- Cache duration: varies, typically minutes to hours
Implementation priority:
- Long system prompts (>500 tokens): highest caching ROI
- Document context that's the same across multiple queries (RAG context)
- Multi-turn conversations where early messages repeat
Calculating cache savings:
If your system prompt is 2,000 tokens, used 1,000 times/day:
- Without caching: 2M tokens × $3.00/M = $6.00/day on just the system prompt
- With caching (90% cached hit rate): 200K uncached + 1.8M cached = $0.60 + $0.54 = $1.14/day (81% savings)
3. Batch Processing (50% on Eligible Workloads)
For async workloads (content moderation, data enrichment, nightly jobs), batch APIs offer flat 50% discounts.
OpenAI Batch API:
- 50% off standard pricing
- Results within 24 hours
- Minimum 10 requests per batch (thousands is fine)
Eligible workloads:
- Nightly sentiment analysis
- Batch document classification
- Content moderation pipelines
- Weekly data enrichment jobs
Not eligible: Anything user-facing or latency-sensitive.
Implementation: See the OpenAI Batch API guide. The engineering investment is 1-3 hours for most use cases.
4. Output Token Reduction (10-30% on Output Costs)
Output tokens cost 3-10x more than input tokens. Reducing them has high ROI.
Tactics:
Constrain output format: "Respond in JSON only. No explanation." Eliminates verbose preambles.
Specify maximum length: "In 2-3 sentences" or "Maximum 150 words" forces conciseness.
Structured outputs: OpenAI's Structured Outputs and Anthropic's tool_use mode enforce schema adherence, which eliminates verbose fallback behavior.
Few-shot examples of concise output: Show the model what a good short answer looks like in the prompt.
Before optimization: "The sentiment of this review is positive. The customer seems very satisfied with the product and would likely recommend it to others." (38 tokens)
After: "positive" (1 token) — 38x more efficient for this task.
At 1M classifications/day with 38-token vs. 1-token output:
- Before: 38M output tokens × $0.60/M = $22.80/day
- After: 1M output tokens × $0.60/M = $0.60/day
5. Semantic Caching (30-70% on Repeated Queries)
Users ask similar questions. If you're calling the LLM for semantically identical queries, you're paying twice.
How it works:
- Embed incoming queries using an embedding model
- Compare to cached query embeddings using cosine similarity
- If similarity > threshold (typically 0.95), return cached response
- If below threshold, call LLM and cache the result
Tools: GPTCache, Langchain's caching layer, or roll your own with Redis + pgvector.
When semantic caching is most effective:
- Q&A systems over a fixed knowledge base
- Product recommendation explanations
- FAQ bots
- Customer support with common questions
When it's not effective:
- Truly unique queries (creative generation, novel analysis)
- Queries where context changes meaning (time-sensitive questions)
- High-variance user inputs
Example savings: A legal Q&A product with 10,000 daily queries found that 40% were semantically similar to previous queries within a 24-hour window. Semantic caching on that 40% = 40% fewer LLM calls = 40% lower cost.
Part 3: Infrastructure and Operational Optimization
Self-Hosting Open Source Models
For workloads with appropriate volume and data privacy requirements, self-hosting eliminates per-token API fees.
When self-hosting makes sense:
- >100M tokens/day (economics often justify hardware below this)
- Data that can't leave your infrastructure (HIPAA, GDPR regulated)
- Use cases where fine-tuning on proprietary data is required
- Predictable workloads that can fill GPU capacity
Economics of self-hosting Llama 70B:
- AWS p3.8xlarge (4x V100): ~$12.24/hour on-demand ($8.90 reserved)
- Llama 3.3 70B: ~25-35 tokens/second on this hardware
- At 30 tokens/second capacity: 2.6M tokens/hour
- Cost per million tokens: $12.24 / 2.6 = $4.71/M (combined input+output)
- Compare to Llama via Groq API: $0.23-0.40/M input+output
Self-hosting only beats API pricing at massive scale. At moderate volumes, API is almost always more cost-effective after accounting for engineering overhead, hardware management, and utilization rates.
Rate Limits and Retry Logic
Poorly implemented retry logic can multiply your costs.
Bad retry pattern:
for attempt in range(10):
try:
response = call_api(prompt)
break
except Exception:
time.sleep(1)
continue # Retries immediately, hammers API
Good retry pattern:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10))
def call_api_with_retry(prompt):
return call_api(prompt)
Exponential backoff prevents thundering herd. Max 3 retries prevents runaway costs from systematic errors.
Request Deduplication
If multiple users submit the same exact query simultaneously, process it once and return the same response to all.
Implementation: Hash the input prompt. Check cache before calling API. Set TTL based on staleness tolerance.
In busy applications, deduplication can reduce API calls by 5-20% during peak times.
Part 4: Building Cost Culture on Your Team
Cost Budgets by Feature
Assign cost budgets to each AI-powered feature:
| Feature | Budget/Call | Budget/Day | Action if exceeded |
|---|---|---|---|
| Query classification | $0.001 | $50 | Alert + investigate |
| Document summarization | $0.05 | $500 | Alert |
| Code review | $0.20 | $2,000 | Alert + sampling |
| Report generation | $2.00 | $200 | Alert + queue |
Feature teams that own their costs optimize them. Teams that share a global AI budget don't.
The Monthly AI Cost Review
Build a monthly review into your engineering calendar:
- Cost by feature (which features drove growth?)
- Cost by model (are we using the cheapest effective model?)
- Cache hit rates (is caching working?)
- Token waste (are output tokens too long?)
- Latency vs. cost tradeoff (is the expensive model justified by latency?)
Monthly reviews catch drift before it becomes a crisis. AI costs compound — a feature that costs $200/day in January can cost $600/day by June without anyone noticing.
Use our AI Batch Processing Cost Calculator and LLM Cost Comparison Calculator to model your optimization opportunities before implementing.