AI API costs are the fastest-growing line item in tech infrastructure budgets. A startup that starts with $200/month in API costs can find itself at $20,000/month within a year of scaling — often without understanding what drove the increase.
This guide covers the complete optimization playbook: what drives costs, how to measure them accurately, and the specific techniques that reduce them by 50-80% in production systems.
Part 1: Understanding What You're Actually Paying For
The Token Economy
Every major LLM charges by tokens — the atomic units of text. Understanding token economics is the foundation of cost optimization.
What is a token?
- Roughly 3/4 of a word in English
- "Hello world" ≈ 2 tokens
- A standard 800-word article ≈ 1,000-1,100 tokens
- 1 million tokens ≈ 750,000 words
Input vs. output tokens: Providers charge differently for input (prompt) and output (completion) tokens. Output tokens typically cost 2-5x more than input tokens. This matters enormously for optimization strategy — reducing output length often saves more per token than reducing input.
| Provider/Model | Input per 1M tokens | Output per 1M tokens | Output premium |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 4x |
| Claude Sonnet 4.5 | $3.00 | $15.00 | 5x |
| Claude Haiku 4.5 | $0.80 | $4.00 | 5x |
| Gemini 1.5 Flash | $0.075 | $0.30 | 4x |
| GPT-4o mini | $0.15 | $0.60 | 4x |
The implication: A system that generates long responses costs disproportionately more than one that generates short responses, even with identical input.
Where Tokens Come From
Before optimizing, instrument your system to understand where tokens are spent:
import anthropic
from dataclasses import dataclass
from typing import Optional
@dataclass
class TokenUsageRecord:
request_id: str
component: str # "system_prompt", "conversation_history", "user_input", "rag_context"
input_tokens: int
output_tokens: int
model: str
cost_usd: float
def track_usage(response, component: str) -> TokenUsageRecord:
usage = response.usage
cost = calculate_cost(usage.input_tokens, usage.output_tokens, model=response.model)
return TokenUsageRecord(
request_id=response.id,
component=component,
input_tokens=usage.input_tokens,
output_tokens=usage.output_tokens,
model=response.model,
cost_usd=cost
)
Most teams find:
- 20-35% of tokens are system prompt
- 25-40% are conversation history
- 15-25% are retrieved context (RAG)
- 10-20% are user input
- 10-25% are output
This distribution guides where to focus optimization effort.
Part 2: Model Selection and Routing
The single highest-impact optimization is using the right model for each task. Most applications use one model for everything — often the most capable (expensive) one.
The Model Tier Framework
| Tier | Models | Cost (relative) | Best for |
|---|---|---|---|
| Economy | GPT-4o mini, Claude Haiku, Gemini Flash | 1x | Classification, simple Q&A, summarization |
| Standard | GPT-4o, Claude Sonnet | 15-20x | Complex reasoning, code, nuanced writing |
| Premium | o3, Claude Opus | 50-100x | Research, frontier tasks, agentic reasoning |
Running everything through the Standard tier when 70% of requests could use Economy is a 10-15x cost inefficiency.
Implementing Intelligent Routing
A classifier-based router sends each request to the appropriate model:
from anthropic import Anthropic
client = Anthropic()
ROUTING_PROMPT = """Classify the complexity of this user request.
Return ONLY one of: SIMPLE, MODERATE, COMPLEX
SIMPLE: FAQ questions, yes/no, factual lookups, basic formatting
MODERATE: Multi-step reasoning, code generation, detailed explanations
COMPLEX: Novel research, complex debugging, creative synthesis
Request: {request}"""
MODEL_ROUTING = {
"SIMPLE": "claude-haiku-4-5-20251001",
"MODERATE": "claude-sonnet-4-5",
"COMPLEX": "claude-opus-4-7",
}
def route_and_respond(user_message: str) -> str:
# Step 1: Classify (using cheapest model)
classification = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{
"role": "user",
"content": ROUTING_PROMPT.format(request=user_message)
}]
)
complexity = classification.content[0].text.strip()
# Step 2: Route to appropriate model
model = MODEL_ROUTING.get(complexity, "claude-sonnet-4-5")
response = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
Expected savings from routing: 50-70% cost reduction when 60-70% of requests are simple.
Fallback Chains
Some requests are initially sent to economy models with fallback to premium if quality is insufficient:
def respond_with_quality_check(user_message: str) -> str:
# Try economy model first
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{"role": "user", "content": user_message}]
)
# Quality gate — simple heuristic or another model call
if is_quality_sufficient(response.content[0].text, user_message):
return response.content[0].text
# Escalate to standard model
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
Part 3: Prompt Optimization
System Prompt Compression
System prompts are sent with every request. A bloated system prompt can add thousands of tokens per request at scale.
Before (verbose, 250 tokens):
You are a helpful customer support assistant for TechCorp. Your primary role is to assist our valued customers with any questions or concerns they may have about our products and services. You should always maintain a professional and courteous tone in all interactions. You should provide accurate and helpful information to the best of your ability. When you don't know the answer to something, you should be transparent about this rather than making something up or guessing...
After (compressed, 45 tokens):
TechCorp support. Accurate, professional, brief. Admit uncertainty rather than guess. Escalate: billing→Sarah, technical bugs→eng-support@techcorp.com.
Savings at 1M requests: 205 tokens × 1M × $0.003/1K = $615/month from one optimization.
Output Length Control
Output tokens are expensive. Control them explicitly:
In the prompt:
- "Respond in 2-3 sentences"
- "Return only the JSON object, no explanation"
- "List the top 3 items only"
Via the API:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=150, # Hard cap on output
messages=[...]
)
Structured output compression:
Instead of verbose JSON:
{
"sentiment": "positive",
"confidence": 0.87,
"categories": ["product_quality", "delivery"]
}
Request compact format and parse:
pos|0.87|quality,delivery
Output tokens reduced by 60-70% for structured responses.
Conversation History Management
In multi-turn conversations, naive implementations send the full history:
Turn 1: 500 input tokens
Turn 2: 500 + 800 = 1,300 input tokens
Turn 5: 4,800 input tokens
Turn 10: 9,600 input tokens
Solution: Rolling window with summarization
def manage_conversation_context(
messages: list,
max_messages: int = 6,
summarize_after: int = 10
) -> list:
if len(messages) <= max_messages:
return messages
if len(messages) >= summarize_after:
# Summarize old messages
old_messages = messages[:-4]
summary = summarize_messages(old_messages)
recent = messages[-4:]
return [
{"role": "user", "content": f"[Earlier conversation summary: {summary}]"},
{"role": "assistant", "content": "Understood."}
] + recent
# Simple rolling window
return messages[-max_messages:]
Typical savings: 40-60% reduction in conversation history tokens.
Part 4: Caching Strategies
Exact Match Caching
Cache responses for identical prompts:
import hashlib
import json
import redis
cache = redis.Redis()
def get_cached_response(prompt: str, ttl: int = 3600) -> str | None:
key = hashlib.sha256(prompt.encode()).hexdigest()
return cache.get(key)
def set_cached_response(prompt: str, response: str, ttl: int = 3600):
key = hashlib.sha256(prompt.encode()).hexdigest()
cache.setex(key, ttl, response)
def complete_with_cache(prompt: str) -> str:
cached = get_cached_response(prompt)
if cached:
return cached.decode()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
result = response.content[0].text
set_cached_response(prompt, result)
return result
Expected hit rate:
- FAQ/support: 30-60% hit rate
- Search features: 20-40%
- Creative tasks: 5-15%
Semantic Caching
For near-identical queries, use embedding similarity:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
def get_semantic_cache(query: str, threshold: float = 0.95) -> str | None:
query_embedding = model.encode(query)
# Search vector DB for similar queries
results = vector_db.search(query_embedding, limit=1)
if results and results[0].score >= threshold:
return results[0].cached_response
return None
Provider-Side Prompt Caching
Both Anthropic and OpenAI offer server-side caching for repeated long prompts:
Anthropic (90% discount on cached tokens, 5-minute TTL):
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[{
"type": "text",
"text": long_system_prompt,
"cache_control": {"type": "ephemeral"} # Enable caching
}],
messages=[{"role": "user", "content": user_message}]
)
For a 5,000-token system prompt sent 100K times/month:
- Without caching: 500M tokens × $3/1M = $1,500/month
- With caching (90% discount after first): ~$160/month
- Savings: $1,340/month
Part 5: Batch Processing
For non-real-time workloads, batch APIs offer 50% discounts:
OpenAI Batch API:
import openai
client = openai.OpenAI()
# Create batch file
batch_requests = [
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": task}]
}
}
for i, task in enumerate(tasks)
]
# Submit batch
batch_file = client.files.create(
file=json.dumps(batch_requests).encode(),
purpose="batch"
)
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
Batch API eligibility:
- Document processing pipelines
- Overnight classification runs
- Scheduled content generation
- Data extraction from large datasets
If 40% of your workload is eligible for batching, you immediately save 20% on those requests.
Part 6: Architecture-Level Optimizations
Avoid Redundant Processing
Common inefficiency: multiple sequential LLM calls for tasks that could be combined:
Wasteful:
- LLM call → classify sentiment
- LLM call → extract entities
- LLM call → generate summary
Optimized:
COMBINED_PROMPT = """Analyze this text and return JSON:
{
"sentiment": "positive|negative|neutral",
"entities": ["entity1", "entity2"],
"summary": "one sentence summary"
}
Text: {text}"""
One call instead of three. Saves ~67% on this workflow.
RAG Context Optimization
Retrieval-Augmented Generation adds context to prompts — but often adds more context than necessary:
Problem: Retrieving 5 chunks at 500 tokens each = 2,500 tokens of context per query.
Solutions:
- Reduce chunk size: 200-300 tokens vs. 500 often maintains accuracy
- Reduce chunk count: Top 2-3 vs. top 5 with better retrieval
- Rerank before including: Use a cheap reranker (bm25, cross-encoder) to confirm relevance
- Compress context: Summarize retrieved content before including in prompt
Typical RAG context reduction: 40-60% with no quality loss.
Infrastructure: Self-Hosting for Scale
At high enough volume, self-hosted open-source models become cost-competitive:
| Monthly requests | API cost (GPT-4o-mini) | Self-hosted (Llama 3.1 8B on GPU) |
|---|---|---|
| 1M | $300 | $150-200 (GPU compute) |
| 10M | $3,000 | $200-500 |
| 100M | $30,000 | $500-1,500 |
The breakeven for self-hosting is typically 5-15M requests/month, depending on model size and cloud GPU pricing.
Part 7: Measurement and Continuous Optimization
Cost Tracking Dashboard
Track these metrics by request type, user segment, and time:
from prometheus_client import Counter, Histogram, Gauge
llm_cost_total = Counter('llm_cost_usd_total', 'Total LLM cost', ['model', 'component'])
llm_tokens_total = Counter('llm_tokens_total', 'Total tokens', ['model', 'type'])
llm_cache_hits = Counter('llm_cache_hits_total', 'Cache hits', ['cache_type'])
llm_latency = Histogram('llm_latency_seconds', 'LLM latency')
Dashboard alerts to set:
- Cost per 1,000 requests exceeds baseline by 20%
- Cache hit rate drops below 15%
- Average output tokens increases by 30%+
- Model routing decisions change significantly
A/B Testing Cost Changes
When you implement an optimization, A/B test it:
import random
def complete_optimized_or_baseline(prompt: str) -> tuple[str, str]:
if random.random() < 0.5:
response = baseline_complete(prompt)
variant = "baseline"
else:
response = optimized_complete(prompt)
variant = "optimized"
return response, variant
Track: cost per request, output quality score, user satisfaction. Only deploy optimizations where quality metrics hold.
Summary: The Implementation Sequence
Week 1-2 (Quick wins, minimal risk):
- Implement exact-match caching
- Add
max_tokenslimits to all API calls - Compress system prompts
Month 1 (Infrastructure):
- Model routing (economy/standard/premium)
- Conversation history pruning
- Provider-side prompt caching
Month 2-3 (Advanced):
- Semantic caching
- Batch API for eligible workloads
- RAG context optimization
Month 4+ (Scale):
- Cost per feature tracking
- Continuous A/B testing of optimizations
- Evaluate self-hosting threshold
Teams that implement this full sequence typically achieve 70-80% cost reduction from their baseline, with no measurable quality degradation on standard tasks.
Use the AI Inference Cost Calculator to model your potential savings at each optimization stage.