最近更新: 4h ago

Knowledge 基准测试

MMLU Professional 排行榜

Harder version of MMLU with reasoning-focused questions and 10 answer choices instead of 4. Contains 12,000+ questions across 14 domains.

为什么重要： Better at differentiating top models since scores are 16-33% lower than standard MMLU. Tests reasoning in addition to knowledge.

顶级模型

91.2%

Gemini 3.1 Pro

平均评分

71.8%

共81个模型

已测试模型

指标: accuracy

人类基准

评分范围: 0%–100%

MMLU-Pro Scores - Top 25 Models

Ranked by MMLU-Pro score (%)

LMMarketCap.com

模型排名

All models with a reported MMLU-Pro score, ranked by highest accuracy.

排名模型评分性能

Gemini 3.1 Pro Google

91.2%

91%

91.2%

Gemini 3 Pro Google

88%

MiniMax M2.1 MiniMax

88%

Qwen3.5 397B A17B Alibaba

87.8%

88%

87.8%

DeepSeek V4 Pro DeepSeek

87.5%

88%

87.5%

GPT-5.4 OpenAI

87%

Qwen3.5-122B-A10B Alibaba

86.7%

87%

86.7%

DeepSeek V4 Flash DeepSeek

86.2%

86%

86.2%

Gemini 3.1 Flash Lite Preview Google

86.2%

86%

86.2%

#10

Qwen3.5-27B Alibaba

86.1%

86%

86.1%

#11

Gemini 2.5 Pro Google

86%

#11

GPT-5.2 OpenAI

86%

#11

GLM 5 Zhipu AI

86%

#14

o3 OpenAI

85.8%

86%

85.8%

#15

Qwen3 Max Thinking Alibaba

85.7%

86%

85.7%

#16

Qwen3.5-35B-A3B Alibaba

85.3%

85%

85.3%

#17

Gemma 4 31B Google

85.2%

85%

85.2%

#18

GPT-5.1 OpenAI

85%

#18

DeepSeek V3.2 DeepSeek

85%

#20

GLM 4.5 Zhipu AI

84.6%

85%

84.6%

#21

GPT-5 OpenAI

84.5%

85%

84.5%

#22

DeepSeek R1 DeepSeek

84%

#23

Claude Sonnet 4 Anthropic

83.7%

84%

83.7%

#24

Grok 4 xAI

83.5%

84%

83.5%

#25

DeepSeek R1-0528 DeepSeek

83.4%

83%

83.4%

#26

o4-mini OpenAI

83%

#27

o3-mini OpenAI

82.5%

83%

82.5%

#27

Claude Opus 4.6 Anthropic

82.5%

83%

82.5%

#27

Qwen3.5-9B Alibaba

82.5%

83%

82.5%

#30

Gemini 3 Flash Google

82%

#30

MiniMax M2 MiniMax

82%

#32

GPT-4.1 OpenAI

81.8%

82%

81.8%

#33

GLM 4.5 Air Zhipu AI

81.4%

81%

81.4%

#34

MiniMax M1 MiniMax

81.1%

81%

81.1%

#35

Qwen3 30B A3B Thinking 2507 Alibaba

80.9%

81%

80.9%

#36

Llama 4 Maverick Meta

80.5%

81%

80.5%

#37

o1 OpenAI

80.3%

80%

80.3%

#38

MiniMax M2.5 MiniMax

80.1%

80%

80.1%

#39

Claude Opus 4 Anthropic

80%

#40

Claude Sonnet 4.6 Anthropic

79%

#41

Claude Opus 4.5 Anthropic

78%

#41

Gemini 2.5 Flash Google

78%

#41

Grok 3 xAI

78%

#44

DeepSeek V3 (March 2025)DeepSeek

77.5%

78%

77.5%

#45

Claude Sonnet 4.5 Anthropic

76.5%

77%

76.5%

#46

DeepSeek V3 DeepSeek

75.9%

76%

75.9%

#47

Grok 2 xAI

75.5%

76%

75.5%

#48

Llama 4 Scout Meta

74.3%

74%

74.3%

#49

Claude 3.7 Sonnet Anthropic

74%

#50

Claude 3.5 Sonnet Anthropic

73.8%

74%

73.8%

#51

o1-mini OpenAI

73.5%

74%

73.5%

#52

Llama 3 1 405b Instruct

73.3%

73%

73.3%

#53

GPT-4o OpenAI

72.6%

73%

72.6%

#54

Qwen 2.5 72B Alibaba

71.1%

71%

71.1%

#55

Phi-4 Microsoft

70.5%

71%

70.5%

#56

Gemini 1.5 Pro Google

70.3%

70%

70.3%

#57

Mistral Large 2 Mistral AI

69.4%

69%

69.4%

#58

Llama 3.3 70B Meta

68.9%

69%

68.9%

#59

Claude 3 Opus Anthropic

68.5%

69%

68.5%

#60

Qwen3 235B A22B Alibaba

68.2%

68%

68.2%

#61

Claude Haiku 4.5 Anthropic

68%

#62

GPT-4 Turbo OpenAI

63.7%

64%

63.7%

#63

GPT-4o mini OpenAI

63.1%

63%

63.1%

#64

Llama 3.1 70B Meta

62.8%

63%

62.8%

#65

Claude 3.5 Haiku Anthropic

62.1%

62%

62.1%

#66

Gemini 2.0 Flash Google

62%

#67

Llama 3.1 405B Meta

61.6%

62%

61.6%

#68

Gemini 1 5 Flash

59.1%

59%

59.1%

#69

Claude 3 Sonnet

56.8%

57%

56.8%

#70

Llama 3 70b Instruct

56.2%

56%

56.2%

#71

Deepseek V2 Chat

54.8%

55%

54.8%

#72

Claude 3 Haiku Anthropic

42.3%

42%

42.3%

#73

WizardLM-2 8x22B Microsoft

39.2%

39%

39.2%

#74

Qwen 2.5 Coder 32B Alibaba

37.9%

38%

37.9%

#75

Qwen2.5 7B Instruct Alibaba

36.5%

37%

36.5%

#76

Command R+Cohere

33.2%

33%

33.2%

#77

Llama 3.1 8B Instruct Meta

30.4%

30%

30.4%

#78

Command R7B (12-2024)Cohere

28.6%

29%

28.6%

#79

Llama 3.2 3B Instruct Meta

23.7%

24%

23.7%

#80

Runway Gen-3 Alpha Runway

22.3%

22%

22.3%

#81

Llama 3 8B Instruct Meta

17.8%

18%

17.8%

关于 MMLU-Pro

全名: MMLU Professional
类别: Knowledge
指标: accuracy (%)
评分范围: 0%–100%
人类基准: 尚未确定
状态: 启用

Frequently Asked Questions

MMLU-Pro is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.

Gemini 3.1 Pro currently holds the top score on the MMLU-Pro benchmark. See our full rankings table above for the complete leaderboard with 81 models.

We update benchmark data from multiple sources including HuggingFace open-source model leaderboards and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.

No. While MMLU-Pro is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.

MMLU Professional 排行榜

模型排名

关于 MMLU-Pro

相关基准测试

MMLU Professional 排行榜

模型排名

关于 MMLU-Pro

相关基准测试