Harder version of MMLU with reasoning-focused questions and 10 answer choices instead of 4. Contains 12,000+ questions across 14 domains.
Why it matters: Better at differentiating top models since scores are 16-33% lower than standard MMLU. Tests reasoning in addition to knowledge.
Top Model
88%
Gemini 3 Pro
Average Score
70.9%
Across 82 models
Models Tested
82
Metric: accuracy
Human Baseline
-
Score Range: 0%–100%
MMLU-Pro Scores - Top 25 Models
Ranked by MMLU-Pro score (%)
All models with a reported MMLU-Pro score, ranked by highest accuracy.
MMLU-Pro is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.
Gemini 3 Pro currently holds the top score on the MMLU-Pro benchmark. See our full rankings table above for the complete leaderboard with 82 models.
We update benchmark data from multiple sources including HuggingFace open-source model leaderboards and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.
No. While MMLU-Pro is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.