Last updated: 1h ago

Knowledge benchmark

MMLU Professional Leaderboard

Harder version of MMLU with reasoning-focused questions and 10 answer choices instead of 4. Contains 12,000+ questions across 14 domains.

Why it matters: Better at differentiating top models since scores are 16-33% lower than standard MMLU. Tests reasoning in addition to knowledge.

Top Model

88%

Gemini 3 Pro

Average Score

70.9%

Across 82 models

Models Tested

Metric: accuracy

Human Baseline

Score Range: 0%–100%

MMLU-Pro Scores - Top 25 Models

Ranked by MMLU-Pro score (%)

LMMarketCap.com

Model Rankings

All models with a reported MMLU-Pro score, ranked by highest accuracy.

RankModelScorePerformance

Gemini 3 Pro Google

88%

MiniMax M2.1 MiniMax

88%

Qwen3.5 397B A17B Alibaba

87.8%

88%

87.8%

GPT-5.4 OpenAI

87%

Qwen3.5-122B-A10B Alibaba

86.7%

87%

86.7%

Gemini 3.1 Flash Lite Preview Google

86.2%

86%

86.2%

Qwen3.5-27B Alibaba

86.1%

86%

86.1%

Gemini 2.5 Pro Google

86%

GPT-5.2 OpenAI

86%

GLM 5 Zhipu AI

86%

#11

o3 OpenAI

85.8%

86%

85.8%

#12

Qwen3 Max Thinking Alibaba

85.7%

86%

85.7%

#13

Qwen3.5-35B-A3B Alibaba

85.3%

85%

85.3%

#14

Gemma 4 31B Google

85.2%

85%

85.2%

#15

GPT-5.1 OpenAI

85%

#16

GLM 4.5 Zhipu AI

84.6%

85%

84.6%

#17

GPT-5 OpenAI

84.5%

85%

84.5%

#18

DeepSeek R1 DeepSeek

84%

#18

Claude 3.7 Sonnet (thinking)Anthropic

84%

#20

Claude Sonnet 4 Anthropic

83.7%

84%

83.7%

#21

Grok 4 xAI

83.5%

84%

83.5%

#22

DeepSeek R1-0528 DeepSeek

83.4%

83%

83.4%

#23

o4-mini OpenAI

83%

#23

Grok 3 Mini xAI

83%

#25

o3-mini OpenAI

82.5%

83%

82.5%

#25

Claude Opus 4.6 Anthropic

82.5%

83%

82.5%

#25

Qwen3.5-9B Alibaba

82.5%

83%

82.5%

#28

Gemini 3 Flash Google

82%

#28

MiniMax M2 MiniMax

82%

#30

GPT-4.1 OpenAI

81.8%

82%

81.8%

#31

GLM 4.5 Air Zhipu AI

81.4%

81%

81.4%

#32

MiniMax M1 MiniMax

81.1%

81%

81.1%

#33

Qwen3 30B A3B Thinking 2507 Alibaba

80.9%

81%

80.9%

#34

Llama 4 Maverick Meta

80.5%

81%

80.5%

#35

o1 OpenAI

80.3%

80%

80.3%

#36

MiniMax M2.5 MiniMax

80.1%

80%

80.1%

#37

Claude Opus 4 Anthropic

80%

#38

Claude Sonnet 4.6 Anthropic

79%

#39

ERNIE 4.5 300B A47B Baidu

78.4%

78%

78.4%

#40

Claude Opus 4.5 Anthropic

78%

#40

Gemini 2.5 Flash Google

78%

#40

Grok 3 xAI

78%

#43

DeepSeek V3 (March 2025)DeepSeek

77.5%

78%

77.5%

#44

Claude Sonnet 4.5 Anthropic

76.5%

77%

76.5%

#45

DeepSeek V3 DeepSeek

75.9%

76%

75.9%

#46

Grok 2 xAI

75.5%

76%

75.5%

#47

Llama 4 Scout Meta

74.3%

74%

74.3%

#48

Claude 3.7 Sonnet Anthropic

74%

#49

Claude 3.5 Sonnet Anthropic

73.8%

74%

73.8%

#50

o1-mini OpenAI

73.5%

74%

73.5%

#51

Llama 3 1 405b Instruct

73.3%

73%

73.3%

#52

GPT-4o OpenAI

72.6%

73%

72.6%

#53

Gemini 2.0 Flash Lite Google

71.6%

72%

71.6%

#54

Qwen 2.5 72B Alibaba

71.1%

71%

71.1%

#55

Phi-4 Microsoft

70.5%

71%

70.5%

#56

Gemini 1.5 Pro Google

70.3%

70%

70.3%

#57

Mistral Large 2 Mistral AI

69.4%

69%

69.4%

#58

Llama 3.3 70B Meta

68.9%

69%

68.9%

#59

Claude 3 Opus Anthropic

68.5%

69%

68.5%

#60

Qwen3 235B A22B Alibaba

68.2%

68%

68.2%

#61

Claude Haiku 4.5 Anthropic

68%

#62

GPT-4 Turbo OpenAI

63.7%

64%

63.7%

#63

GPT-4o mini OpenAI

63.1%

63%

63.1%

#64

Llama 3.1 70B Meta

62.8%

63%

62.8%

#65

Claude 3.5 Haiku Anthropic

62.1%

62%

62.1%

#66

Gemini 2.0 Flash Google

62%

#67

Llama 3.1 405B Meta

61.6%

62%

61.6%

#68

Gemini 1 5 Flash

59.1%

59%

59.1%

#69

Claude 3 Sonnet

56.8%

57%

56.8%

#70

Llama 3 70B Instruct Meta

56.2%

56%

56.2%

#71

Deepseek V2 Chat

54.8%

55%

54.8%

#72

Claude 3 Haiku Anthropic

42.3%

42%

42.3%

#73

WizardLM-2 8x22B Microsoft

39.2%

39%

39.2%

#74

Qwen 2.5 Coder 32B Alibaba

37.9%

38%

37.9%

#75

Qwen2.5 7B Instruct Alibaba

36.5%

37%

36.5%

#76

Command R+Cohere

33.2%

33%

33.2%

#77

Llama 3.1 8B Instruct Meta

30.4%

30%

30.4%

#78

Command R7B (12-2024)Cohere

28.6%

29%

28.6%

#79

Mistral 7B Instruct v0.1 Mistral AI

25.8%

26%

25.8%

#80

Llama 3.2 3B Instruct Meta

23.7%

24%

23.7%

#81

Runway Gen-3 Alpha Runway

22.3%

22%

22.3%

#82

Llama 3 8B Instruct Meta

17.8%

18%

17.8%

About MMLU-Pro

Full Name: MMLU Professional
Category: Knowledge
Metric: accuracy (%)
Score Range: 0%–100%
Human Baseline: Not established
Status: Active

Frequently Asked Questions

MMLU-Pro is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.

Gemini 3 Pro currently holds the top score on the MMLU-Pro benchmark. See our full rankings table above for the complete leaderboard with 82 models.

We update benchmark data from multiple sources including HuggingFace open-source model leaderboards and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.

No. While MMLU-Pro is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.

Related Benchmarks

MMLUKnowledge GPQA DiamondReasoning MATH-500Math HumanEvalCoding SWE-bench VerifiedCoding AIME 2024Math GSM8KMath IFEvalInstruction BBHReasoning

All Benchmarks|Knowledge Benchmarks|Compare Models|LLM Leaderboard