Data · Updated May 2026
LLM Benchmarks 2026
Performance vs cost for major language models. MMLU (general knowledge), HumanEval (coding), and MATH scores compared against API pricing — so you can pick the right model for your budget.
| Model | Provider | MMLU | HumanEval | MATH | Input /1M | Speed | Best for |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4 | Anthropic | 90.4 | 92 | 78.3 | $3.000 | Medium | Coding, long context, analysis |
| GPT-4o | OpenAI | 88.7 | 90.2 | 76.6 | $2.500 | Medium | General purpose, vision tasks |
| Llama 3.3 70B | Meta (via Groq) | 86 | 88.4 | 77 | $0.590 | Very Fast (Groq) | Open-source, latency-critical |
| Gemini 1.5 Pro | 85.9 | 84 | 67.7 | $3.500 | Medium | 2M context, multimodal RAG | |
| Mistral Large 2 | Mistral | 84 | 92 | 69 | $2.000 | Medium | Multilingual, EU compliance |
| Claude Haiku 3.5 | Anthropic | 83 | 88 | 69 | $0.800 | Very Fast | Classification, extraction, routing |
| GPT-4o mini | OpenAI | 82 | 87.2 | 70.2 | $0.150 | Fast | Cost-sensitive applications |
| Gemini 1.5 Flash | 78.9 | 74.3 | 54.9 | $0.075 | Very Fast | Cheap high-volume with long context |
■≥88 excellent ■80–87 good ■<80 below average
Best value picks
Best overall
Claude Sonnet 4
Highest benchmark scores. 90% prompt caching makes it cheapest at scale for long-context apps.
Best budget
Gemini 1.5 Flash
$0.075/1M input. Strong enough for most tasks, 1M context, very fast. Cheapest option by far.
Best speed
Groq Llama 3.3 70B
300+ tokens/sec. Strong benchmark scores, open-source, same price as hosted APIs.
Benchmark scores from public leaderboards and provider documentation. Last updated 2026-05-19.