Data · Updated May 2026

LLM Benchmarks 2026

Performance vs cost for major language models. MMLU (general knowledge), HumanEval (coding), and MATH scores compared against API pricing — so you can pick the right model for your budget.

Model	Provider	MMLU	HumanEval	MATH	Input /1M	Speed	Best for
Claude Sonnet 4	Anthropic	90.4	92	78.3	$3.000	Medium	Coding, long context, analysis
GPT-4o	OpenAI	88.7	90.2	76.6	$2.500	Medium	General purpose, vision tasks
Llama 3.3 70B	Meta (via Groq)	86	88.4	77	$0.590	Very Fast (Groq)	Open-source, latency-critical
Gemini 1.5 Pro	Google	85.9	84	67.7	$3.500	Medium	2M context, multimodal RAG
Mistral Large 2	Mistral	84	92	69	$2.000	Medium	Multilingual, EU compliance
Claude Haiku 3.5	Anthropic	83	88	69	$0.800	Very Fast	Classification, extraction, routing
GPT-4o mini	OpenAI	82	87.2	70.2	$0.150	Fast	Cost-sensitive applications
Gemini 1.5 Flash	Google	78.9	74.3	54.9	$0.075	Very Fast	Cheap high-volume with long context

■≥88 excellent ■80–87 good ■<80 below average

Best value picks

Best overall

Claude Sonnet 4

Highest benchmark scores. 90% prompt caching makes it cheapest at scale for long-context apps.

Best budget

Gemini 1.5 Flash

$0.075/1M input. Strong enough for most tasks, 1M context, very fast. Cheapest option by far.

Best speed

Groq Llama 3.3 70B

300+ tokens/sec. Strong benchmark scores, open-source, same price as hosted APIs.

→ Full pricing table → LLM cost comparison calculator → Inference cost calculator

Benchmark scores from public leaderboards and provider documentation. Last updated 2026-05-19.