aicalcus.com
Data · Updated May 2026

LLM Benchmarks 2026

Performance vs cost for major language models. MMLU (general knowledge), HumanEval (coding), and MATH scores compared against API pricing — so you can pick the right model for your budget.

ModelProviderMMLUHumanEvalMATHInput /1MSpeedBest for
Claude Sonnet 4Anthropic90.49278.3$3.000MediumCoding, long context, analysis
GPT-4oOpenAI88.790.276.6$2.500MediumGeneral purpose, vision tasks
Llama 3.3 70BMeta (via Groq)8688.477$0.590Very Fast (Groq)Open-source, latency-critical
Gemini 1.5 ProGoogle85.98467.7$3.500Medium2M context, multimodal RAG
Mistral Large 2Mistral849269$2.000MediumMultilingual, EU compliance
Claude Haiku 3.5Anthropic838869$0.800Very FastClassification, extraction, routing
GPT-4o miniOpenAI8287.270.2$0.150FastCost-sensitive applications
Gemini 1.5 FlashGoogle78.974.354.9$0.075Very FastCheap high-volume with long context

≥88 excellent  80–87 good  <80 below average

Best value picks

Best overall

Claude Sonnet 4

Highest benchmark scores. 90% prompt caching makes it cheapest at scale for long-context apps.

Best budget

Gemini 1.5 Flash

$0.075/1M input. Strong enough for most tasks, 1M context, very fast. Cheapest option by far.

Best speed

Groq Llama 3.3 70B

300+ tokens/sec. Strong benchmark scores, open-source, same price as hosted APIs.

Benchmark scores from public leaderboards and provider documentation. Last updated 2026-05-19.