Every AI startup has a version of this story: they estimate API costs, multiply by expected usage, add a buffer, and budget accordingly. Three months in, the bill is four times the estimate.
The math they did wasn't wrong. But there are three pricing structures that nobody explains clearly — and together they reliably blow up AI cost projections.
Trap 1: Input Tokens ≠ Your Prompt
Newcomers think they're charged for the text they send. The reality: you're charged for everything in the context window.
Context window = system prompt + conversation history + your new message + (sometimes) retrieved documents from RAG.
A user sends a 20-word message. But the actual input token count:
| Component | Tokens |
|---|---|
| System prompt | 800 |
| Conversation history (5 turns) | 1,200 |
| User message | 25 |
| RAG retrieval (3 docs) | 2,000 |
| Total charged | 4,025 |
You're paying for 4,025 tokens to handle a 25-token message. This isn't fraud — it's how LLMs work. But it means your cost-per-message estimates based on "message length" are off by 10-100x.
Trap 2: Output Tokens Cost 3-5x More
OpenAI and Anthropic both price output tokens significantly higher than input tokens:
| Model | Input price (per 1M tokens) | Output price (per 1M tokens) | Ratio |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 4x |
| GPT-4o mini | $0.15 | $0.60 | 4x |
| Claude Sonnet 4 | $3.00 | $15.00 | 5x |
| Claude Haiku 4 | $0.80 | $4.00 | 5x |
Models that "think out loud" (extended reasoning, chain-of-thought) generate more output tokens. A GPT-4o request with verbose reasoning might output 800 tokens vs. a concise response at 200 tokens — a 4x cost multiplier you didn't budget for.
Fix: Add max_tokens limits to all requests. For most tasks, 300-500 tokens is enough. A 200-token limit on a 4x-output-cost model cuts that component by 75%.
Trap 3: Rate Limits Force Inefficient Architecture
When you hit OpenAI's rate limits (tokens per minute or requests per minute), your options are:
- Slow down and fail user SLAs
- Queue and batch — adding latency
- Upgrade to a higher tier — adding cost
- Cache aggressively — adding infrastructure
Most teams underestimate rate limits during planning. At 10,000 requests/day on GPT-4o at the default tier (3 RPM / 40K TPM), you'd need to spread requests 10x more slowly than naive calculations suggest — or architect around it from day one.
The architectural cost of rate limit mitigation (caching layer, queue system, fallback models) is real engineering time. Budget 2-4 weeks of senior engineer time to do it right. The alternative is a fragile system that breaks under load.
What Actually Works
1. Route aggressively. Implement a simple classifier that sends routine queries to mini/haiku models (90% of requests) and complex queries to the flagship (10%). Most teams that do this cut costs by 70% with less than 2% quality degradation.
2. Cache the system prompt. Both OpenAI (prompt caching) and Anthropic (extended context caching) offer 50-90% discounts on repeated prefixes. Every request with the same system prompt can use the cached version.
3. Measure actual context size. Log tokens per request for a week. You'll find 20% of requests use 80% of tokens. Those outliers are the optimization target.
4. Set token budgets per task type. Simple classification: 100 tokens max. Summarization: 300 tokens. Creative writing: 500 tokens. Hard caps prevent runaway costs from edge cases.
The companies that get AI costs under control aren't the ones running the cheapest models. They're the ones who measure first, then optimize where it actually matters.