aicalcus.com
AI Cost4 min read

AI Embeddings Cost Guide: How Much Does RAG Really Cost at Scale?

Vector embeddings power search, RAG pipelines, and semantic similarity. At 10M documents, embedding costs range from $50 to $5,000+ per month. Here's exactly how to calculate and optimize your spend.

AI Calcus Editorial Team·
AI Embeddings Cost Guide: How Much Does RAG Really Cost at Scale?

What Embeddings Actually Cost

When building RAG (Retrieval-Augmented Generation) pipelines, most developers fixate on LLM inference costs and overlook embedding costs — until they scale.

Embedding pricing (as of 2025):

ProviderModelPrice per 1M tokens
OpenAItext-embedding-3-small$0.02
OpenAItext-embedding-3-large$0.13
Anthropicvia Voyage AI$0.06-0.12
Cohereembed-english-v3$0.10
Googletext-embedding-004$0.025 (first 250M)
Self-hostedall-MiniLM-L6-v2~$0.001 (compute only)

For context: 1,000 words ≈ 750 tokens. A 500-word document chunk ≈ 375 tokens.

Calculating Your Embedding Costs

Initial corpus embedding (one-time):

If you need to embed 1 million documents at 400 tokens average:

  • Total tokens: 400M
  • OpenAI small: 400M × $0.02/1M = $8.00 (one-time)
  • OpenAI large: 400M × $0.13/1M = $52.00 (one-time)

The initial embedding of even large document corpora is surprisingly cheap. The costs that add up are:

Query embedding (ongoing):

Each user search query requires one embedding call. At 100,000 queries/day:

  • 100K queries × 20 tokens avg = 2M tokens/day
  • Monthly: 60M tokens
  • OpenAI small: 60M × $0.02/1M = $1.20/month

Query embedding is almost always negligible.

Re-embedding updates:

When your documents change, you need to re-embed the changed chunks. The real cost comes from update frequency and corpus size. At 10% document churn monthly on a 1M document corpus:

  • 100K documents re-embedded/month × 400 tokens = 40M tokens
  • OpenAI large: 40M × $0.13/1M = $5.20/month

Where Costs Actually Spike

Chunking strategy matters enormously. A naive approach that creates many small chunks instead of semantic chunks increases your token count by 2-3x while reducing retrieval quality. Optimal chunk size for most documents: 256-512 tokens with 50-token overlap.

Re-embedding every document on model update: When OpenAI releases a better embedding model, you need to re-embed your entire corpus. For 10M documents: 4B tokens × $0.13/1M = $520 in one shot. Plan for this in your budget.

Vector storage costs: Embeddings live in vector databases. At 1M 1536-dimension vectors (OpenAI large):

  • Pinecone: ~$70/month (serverless tier)
  • Weaviate Cloud: ~$60/month
  • Qdrant Cloud: ~$40/month
  • pgvector (self-hosted): infrastructure cost only

For smaller corpora (<100K vectors), all three charge <$10/month.

The Self-Hosting Threshold

Self-hosting embedding models becomes cost-effective at high volume:

GPU cost for self-hosted embedding:

  • A100 40GB GPU: ~$2.50/hr on Lambda Labs
  • Throughput: ~500K tokens/minute
  • Monthly capacity: 500K × 60 × 24 × 30 = 21.6B tokens
  • Monthly cost: $2.50 × 720 = $1,800

At $0.02/M tokens (OpenAI small), you'd need 90B tokens/month before self-hosting a single A100 breaks even. That's roughly 90 million 1,000-token documents per month.

Breakeven rule: Self-host when your monthly embedding bill exceeds $500-1,000/month. At that scale, a single GPU instance (or spot instance for batch workloads) pays for itself in 2-3 months.

Optimizing Your RAG Cost Stack

1. Embed once, index smart. Build metadata-filtered retrieval so you search only the relevant slice of your corpus. A customer support bot should only search support docs, not your entire wiki.

2. Cache hot queries. If 20% of queries are repeated (common in FAQ-style applications), cache their embedding + retrieval results. This cuts query-related costs 20-40% in practice.

3. Tiered model strategy. Use the smaller, cheaper embedding model (text-embedding-3-small) for first-pass retrieval across your full corpus, then re-rank top-20 results with a cross-encoder (BERT-based, run locally for free). This achieves large-model quality at small-model cost.

4. Smart chunking. Tools like LlamaIndex and LangChain offer semantic chunking — splitting by meaning rather than token count. Semantic chunks typically improve retrieval relevance by 15-30%, meaning you retrieve in fewer chunks and make fewer LLM calls.

The Full RAG Stack Cost Estimate

For a production RAG application with 500K documents, 10K queries/day:

ComponentMonthly Cost
Initial embedding (one-time, amortized 12mo)$0.35
Monthly re-embedding (5% document update)$0.65
Query embedding$0.18
Vector database (Pinecone serverless)$45
LLM inference (GPT-4o Mini, 1K output/query)$90
Total~$136/month

RAG is cheap. The LLM inference — not the embeddings — dominates cost. This flips the optimization priority: spend more time reducing LLM output length and call frequency than optimizing embedding model choice.


Calculate your embedding costs with our AI Embedding Cost Calculator.

Get weekly AI cost benchmarks & productivity data

Join 4,200+ founders, developers, and creators. No spam, unsubscribe anytime.

#embeddings#rag#vector#openai#cost#ai#semantic-search