aicalcus.com
AI Cost5 min read

7 Proven Ways to Cut Your LLM API Bill by 50% or More

Practical techniques to reduce OpenAI and Anthropic API costs without degrading quality. Includes prompt caching, model routing, batching, and compression strategies with real numbers.

AMAlex Morgan·
7 Proven Ways to Cut Your LLM API Bill by 50% or More
TL;DR

Enable prompt caching first (50–90% off repeated system prompts). Then route 80–90% of requests to cheaper models like GPT-4o mini or Haiku. Use Batch API for non-real-time work (50% discount). Apply all 7 techniques and most teams cut their LLM bill by 60% with no quality loss.

The average team overspends on LLM APIs by 40–60% — not because their product uses too much AI, but because they haven't applied basic cost optimization techniques. Here are seven that work in production.

1. Enable Prompt Caching (Save 50–90% on System Prompts)

Both OpenAI and Anthropic offer significant discounts on repeated input tokens.

  • OpenAI: Automatically caches prompts that repeat the same prefix. Cached tokens cost $1.25/1M (50% off GPT-4o's $2.50/1M)
  • Anthropic: cache_control parameter on system prompt blocks. Cached tokens cost $0.30/1M (90% off Sonnet 4's $3/1M)

Impact: If your system prompt is 2,000 tokens and you make 50,000 requests/month on Claude Sonnet 4:

  • Without caching: $300/month on system prompt tokens
  • With caching: $30/month — $270 saved

Enable caching for any static content: system prompts, tool schemas, few-shot examples.

2. Use the Batch API for Non-Real-Time Tasks (Save 50%)

OpenAI's Batch API and Anthropic's Message Batches API both offer 50% off for requests that don't need immediate responses.

Good batch candidates:

  • Content moderation queues
  • Document classification
  • Embeddings generation
  • Nightly report generation
  • Offline data enrichment

Bad candidates: Anything user-facing where latency matters.

For a workload of 1M tokens/month on GPT-4o mini:

  • Real-time: $0.15/1M input = $150/month
  • Batch API: $0.075/1M input = $75/month

Use the Batch API Cost Calculator to model your specific workload.

3. Route by Complexity (Save 60–80% on Simple Queries)

Not every query needs GPT-4o. A simple yes/no classification doesn't need the same model as writing a legal summary.

Three-tier routing pattern:

TierModelUse forCost/1M tokens
FastGPT-4o mini / Claude Haiku 4Classification, extraction, simple Q&A$0.15–0.80
StandardGPT-4o / Claude Sonnet 4Most production tasks$2.50–3.00
PremiumGPT-4o / Claude Opus 4Complex reasoning, code generation$15–75

Route 70% of queries to the fast tier, 25% to standard, 5% to premium. The average cost per query drops dramatically.

Simple routing logic:

def route_query(query: str, context_length: int) -> str:
    if len(query) < 100 and context_length < 500:
        return "gpt-4o-mini"
    elif context_length > 10000:
        return "claude-sonnet-4"  # better long-context handling
    else:
        return "gpt-4o"

4. Compress Your Prompts (Save 20–40%)

Long system prompts are expensive to repeat. Compression techniques that preserve instruction quality:

  • Remove filler phrases: "Please make sure to always..." → "Always..."
  • Use structured formats: Bullet lists and headers tokenize more efficiently than paragraphs
  • Eliminate redundancy: Don't repeat constraints already implied by the task
  • Use XML tags: <instructions> blocks help models parse efficiently with fewer tokens

A 3,000-token system prompt compressed to 1,800 tokens saves 40% on every uncached request. Use the AI Prompt Cost Optimizer to model your savings.

5. Set Max Tokens Limits Aggressively

The most overlooked cost lever: models generate tokens up to max_tokens if not constrained. Bloated responses happen when you don't set limits.

Before: max_tokens: 4096 for a Q&A bot that typically needs 150 tokens After: max_tokens: 300 with a truncation note in the system prompt

If your average response is 300 tokens but max_tokens is 4096, you're paying for the risk of a long response — not the actual response. In practice, models rarely generate maximum output, but the safety margin costs money.

Analyze your actual p95 output length in logs and set max_tokens to 150% of that.

6. Cache Responses for Repeated Queries (Save 40–70%)

LLM response caching is different from prompt caching. For queries where the same input generates the same output:

  • Semantic similarity cache (Redis + embeddings): Cache responses for queries within cosine distance 0.95
  • Exact match cache: Hash the full prompt; skip the API call entirely

For a customer support bot, 30–50% of queries are common repeats ("How do I cancel?", "Where's my order?"). Caching these saves their full cost.

Implementation: Embed the user query, search a vector store of past (query, response) pairs, return the cached response if similarity > 0.95.

7. Reduce Context Window Usage

Each token in the conversation history costs money. Common mistakes:

  • Sending full chat history every request: Summarize older turns instead ("User and assistant discussed X, Y, Z in previous turns...")
  • Including full retrieved documents: Send only the relevant paragraph, not the full page
  • Verbose tool responses: Trim API responses before injecting into context

Cutting average context from 4,000 to 2,500 tokens is a 37.5% cost reduction — with zero impact on user experience in most cases.

Combined Impact

Apply all seven techniques to a $5,000/month API bill:

TechniqueSavings
Prompt caching−$800
Batch API (30% of workload)−$450
Model routing−$1,200
Prompt compression−$300
Max tokens limiting−$200
Response caching−$400
Context reduction−$350
Total−$3,700 (74% reduction)

The techniques compound. Start with prompt caching (easiest, highest ROI) and model routing (most impactful), then layer in the rest.

Get weekly AI cost benchmarks & productivity data

For founders, developers, and creators. No spam, unsubscribe anytime.

#cost-reduction#prompt-caching#model-routing#batch-api#openai#anthropic#llm