What Embeddings Actually Cost
When building RAG (Retrieval-Augmented Generation) pipelines, most developers fixate on LLM inference costs and overlook embedding costs — until they scale.
Embedding pricing (as of 2025):
| Provider | Model | Price per 1M tokens |
|---|---|---|
| OpenAI | text-embedding-3-small | $0.02 |
| OpenAI | text-embedding-3-large | $0.13 |
| Anthropic | via Voyage AI | $0.06-0.12 |
| Cohere | embed-english-v3 | $0.10 |
| text-embedding-004 | $0.025 (first 250M) | |
| Self-hosted | all-MiniLM-L6-v2 | ~$0.001 (compute only) |
For context: 1,000 words ≈ 750 tokens. A 500-word document chunk ≈ 375 tokens.
Calculating Your Embedding Costs
Initial corpus embedding (one-time):
If you need to embed 1 million documents at 400 tokens average:
- Total tokens: 400M
- OpenAI small: 400M × $0.02/1M = $8.00 (one-time)
- OpenAI large: 400M × $0.13/1M = $52.00 (one-time)
The initial embedding of even large document corpora is surprisingly cheap. The costs that add up are:
Query embedding (ongoing):
Each user search query requires one embedding call. At 100,000 queries/day:
- 100K queries × 20 tokens avg = 2M tokens/day
- Monthly: 60M tokens
- OpenAI small: 60M × $0.02/1M = $1.20/month
Query embedding is almost always negligible.
Re-embedding updates:
When your documents change, you need to re-embed the changed chunks. The real cost comes from update frequency and corpus size. At 10% document churn monthly on a 1M document corpus:
- 100K documents re-embedded/month × 400 tokens = 40M tokens
- OpenAI large: 40M × $0.13/1M = $5.20/month
Where Costs Actually Spike
Chunking strategy matters enormously. A naive approach that creates many small chunks instead of semantic chunks increases your token count by 2-3x while reducing retrieval quality. Optimal chunk size for most documents: 256-512 tokens with 50-token overlap.
Re-embedding every document on model update: When OpenAI releases a better embedding model, you need to re-embed your entire corpus. For 10M documents: 4B tokens × $0.13/1M = $520 in one shot. Plan for this in your budget.
Vector storage costs: Embeddings live in vector databases. At 1M 1536-dimension vectors (OpenAI large):
- Pinecone: ~$70/month (serverless tier)
- Weaviate Cloud: ~$60/month
- Qdrant Cloud: ~$40/month
- pgvector (self-hosted): infrastructure cost only
For smaller corpora (<100K vectors), all three charge <$10/month.
The Self-Hosting Threshold
Self-hosting embedding models becomes cost-effective at high volume:
GPU cost for self-hosted embedding:
- A100 40GB GPU: ~$2.50/hr on Lambda Labs
- Throughput: ~500K tokens/minute
- Monthly capacity: 500K × 60 × 24 × 30 = 21.6B tokens
- Monthly cost: $2.50 × 720 = $1,800
At $0.02/M tokens (OpenAI small), you'd need 90B tokens/month before self-hosting a single A100 breaks even. That's roughly 90 million 1,000-token documents per month.
Breakeven rule: Self-host when your monthly embedding bill exceeds $500-1,000/month. At that scale, a single GPU instance (or spot instance for batch workloads) pays for itself in 2-3 months.
Optimizing Your RAG Cost Stack
1. Embed once, index smart. Build metadata-filtered retrieval so you search only the relevant slice of your corpus. A customer support bot should only search support docs, not your entire wiki.
2. Cache hot queries. If 20% of queries are repeated (common in FAQ-style applications), cache their embedding + retrieval results. This cuts query-related costs 20-40% in practice.
3. Tiered model strategy. Use the smaller, cheaper embedding model (text-embedding-3-small) for first-pass retrieval across your full corpus, then re-rank top-20 results with a cross-encoder (BERT-based, run locally for free). This achieves large-model quality at small-model cost.
4. Smart chunking. Tools like LlamaIndex and LangChain offer semantic chunking — splitting by meaning rather than token count. Semantic chunks typically improve retrieval relevance by 15-30%, meaning you retrieve in fewer chunks and make fewer LLM calls.
The Full RAG Stack Cost Estimate
For a production RAG application with 500K documents, 10K queries/day:
| Component | Monthly Cost |
|---|---|
| Initial embedding (one-time, amortized 12mo) | $0.35 |
| Monthly re-embedding (5% document update) | $0.65 |
| Query embedding | $0.18 |
| Vector database (Pinecone serverless) | $45 |
| LLM inference (GPT-4o Mini, 1K output/query) | $90 |
| Total | ~$136/month |
RAG is cheap. The LLM inference — not the embeddings — dominates cost. This flips the optimization priority: spend more time reducing LLM output length and call frequency than optimizing embedding model choice.
Calculate your embedding costs with our AI Embedding Cost Calculator.