AI Embeddings Cost Guide: How Much Does RAG Really Cost at Scale?

What Embeddings Actually Cost

When building RAG (Retrieval-Augmented Generation) pipelines, most developers fixate on LLM inference costs and overlook embedding costs — until they scale.

Embedding pricing (as of 2025):

Provider	Model	Price per 1M tokens
OpenAI	text-embedding-3-small	$0.02
OpenAI	text-embedding-3-large	$0.13
Anthropic	via Voyage AI	$0.06-0.12
Cohere	embed-english-v3	$0.10
Google	text-embedding-004	$0.025 (first 250M)
Self-hosted	all-MiniLM-L6-v2	~$0.001 (compute only)

For context: 1,000 words ≈ 750 tokens. A 500-word document chunk ≈ 375 tokens.

Calculating Your Embedding Costs

Initial corpus embedding (one-time):

If you need to embed 1 million documents at 400 tokens average:

Total tokens: 400M
OpenAI small: 400M × $0.02/1M = $8.00 (one-time)
OpenAI large: 400M × $0.13/1M = $52.00 (one-time)

The initial embedding of even large document corpora is surprisingly cheap. The costs that add up are:

Query embedding (ongoing):

Each user search query requires one embedding call. At 100,000 queries/day:

100K queries × 20 tokens avg = 2M tokens/day
Monthly: 60M tokens
OpenAI small: 60M × $0.02/1M = $1.20/month

Query embedding is almost always negligible.

Re-embedding updates:

When your documents change, you need to re-embed the changed chunks. The real cost comes from update frequency and corpus size. At 10% document churn monthly on a 1M document corpus:

100K documents re-embedded/month × 400 tokens = 40M tokens
OpenAI large: 40M × $0.13/1M = $5.20/month

Where Costs Actually Spike

Chunking strategy matters enormously. A naive approach that creates many small chunks instead of semantic chunks increases your token count by 2-3x while reducing retrieval quality. Optimal chunk size for most documents: 256-512 tokens with 50-token overlap.

Re-embedding every document on model update: When OpenAI releases a better embedding model, you need to re-embed your entire corpus. For 10M documents: 4B tokens × $0.13/1M = $520 in one shot. Plan for this in your budget.

Vector storage costs: Embeddings live in vector databases. At 1M 1536-dimension vectors (OpenAI large):

Pinecone: ~$70/month (serverless tier)
Weaviate Cloud: ~$60/month
Qdrant Cloud: ~$40/month
pgvector (self-hosted): infrastructure cost only

For smaller corpora (<100K vectors), all three charge <$10/month.

The Self-Hosting Threshold

Self-hosting embedding models becomes cost-effective at high volume:

GPU cost for self-hosted embedding:

A100 40GB GPU: ~$2.50/hr on Lambda Labs
Throughput: ~500K tokens/minute
Monthly capacity: 500K × 60 × 24 × 30 = 21.6B tokens
Monthly cost: $2.50 × 720 = $1,800

At $0.02/M tokens (OpenAI small), you'd need 90B tokens/month before self-hosting a single A100 breaks even. That's roughly 90 million 1,000-token documents per month.

Breakeven rule: Self-host when your monthly embedding bill exceeds $500-1,000/month. At that scale, a single GPU instance (or spot instance for batch workloads) pays for itself in 2-3 months.

Optimizing Your RAG Cost Stack

1. Embed once, index smart. Build metadata-filtered retrieval so you search only the relevant slice of your corpus. A customer support bot should only search support docs, not your entire wiki.

2. Cache hot queries. If 20% of queries are repeated (common in FAQ-style applications), cache their embedding + retrieval results. This cuts query-related costs 20-40% in practice.

3. Tiered model strategy. Use the smaller, cheaper embedding model (text-embedding-3-small) for first-pass retrieval across your full corpus, then re-rank top-20 results with a cross-encoder (BERT-based, run locally for free). This achieves large-model quality at small-model cost.

4. Smart chunking. Tools like LlamaIndex and LangChain offer semantic chunking — splitting by meaning rather than token count. Semantic chunks typically improve retrieval relevance by 15-30%, meaning you retrieve in fewer chunks and make fewer LLM calls.

The Full RAG Stack Cost Estimate

For a production RAG application with 500K documents, 10K queries/day:

Component	Monthly Cost
Initial embedding (one-time, amortized 12mo)	$0.35
Monthly re-embedding (5% document update)	$0.65
Query embedding	$0.18
Vector database (Pinecone serverless)	$45
LLM inference (GPT-4o Mini, 1K output/query)	$90
Total	~$136/month

RAG is cheap. The LLM inference — not the embeddings — dominates cost. This flips the optimization priority: spend more time reducing LLM output length and call frequency than optimizing embedding model choice.

Calculate your embedding costs with our AI Embedding Cost Calculator.