最近更新: 5h ago

Coding 基准测试

BigCodeBench (Hard) 排行榜

Practical code generation requiring use of libraries, APIs, and complex program structures. The 'Hard' subset tests non-trivial engineering tasks.

为什么重要： More realistic than HumanEval — tests practical programming skills including library usage, API calls, and multi-file reasoning.

顶级模型

72.1%

Claude Opus 4.6

平均评分

38.8%

共45个模型

已测试模型

指标: pass@1

人类基准

评分范围: 0%–100%

BigCodeBench Scores - Top 25 Models

Ranked by BigCodeBench score (%)

LMMarketCap.com

模型排名

All models with a reported BigCodeBench score, ranked by highest pass@1.

排名模型评分性能

Claude Opus 4.6 Anthropic

72.1%

72%

72.1%

Claude Sonnet 4.6 Anthropic

68.4%

68%

68.4%

Qwen 2.5 Coder 32B Alibaba

60.2%

60%

60.2%

GPT-4o OpenAI

51.1%

51%

51.1%

DeepSeek V3 DeepSeek

50%

Llama 4 Maverick Meta

49.7%

50%

49.7%

GPT-4.1 Mini OpenAI

48.9%

49%

48.9%

GPT-4 Turbo OpenAI

48.2%

48%

48.2%

GPT-4o (2024-11-20)OpenAI

48%

#10

Llama 3.3 70B Meta

46.9%

47%

46.9%

#11

Claude 3.5 Sonnet Anthropic

46.8%

47%

46.8%

#12

Claude 3.5 Haiku Anthropic

46.1%

46%

46.1%

#12

Llama 3.1 70B Meta

46.1%

46%

46.1%

#12

GPT-4o-mini (2024-07-18)OpenAI

46.1%

46%

46.1%

#15

GPT-4 OpenAI

46%

#16

Gemini 2.0 Flash Google

45.9%

46%

45.9%

#17

Qwen 2.5 72B Alibaba

45.8%

46%

45.8%

#18

Claude 3 Opus Anthropic

45.5%

46%

45.5%

#18

Phi-4 Microsoft

45.5%

46%

45.5%

#20

Gemini 1.5 Pro Google

43.8%

44%

43.8%

#21

Mixtral 8x22B Mistral AI

40.6%

41%

40.6%

#22

Claude 3 Haiku Anthropic

39.4%

39%

39.4%

#23

Qwen2.5 7B Instruct Alibaba

37.6%

38%

37.6%

#24

Command R (08-2024)Cohere

37.1%

37%

37.1%

#25

R1 Distill Llama 70B DeepSeek

35.3%

35%

35.3%

#26

Command R+Cohere

33.8%

34%

33.8%

#27

o3-mini OpenAI

33.1%

33%

33.1%

#28

Llama 3.1 8B Instruct Meta

32.8%

33%

32.8%

#29

o1 OpenAI

32.4%

32%

32.4%

#29

Claude Sonnet 4 Anthropic

32.4%

32%

32.4%

#31

Llama 3 8B Instruct Meta

31.9%

32%

31.9%

#32

Claude 3.7 Sonnet Anthropic

31.8%

32%

31.8%

#32

GPT-4.1 OpenAI

31.8%

32%

31.8%

#34

Mistral Large 2 Mistral AI

30%

#35

Gemini 2.5 Pro Google

29.7%

30%

29.7%

#35

DeepSeek R1 DeepSeek

29.7%

30%

29.7%

#37

GPT-4.1 Nano OpenAI

28.4%

28%

28.4%

#38

o1-mini OpenAI

27.7%

28%

27.7%

#38

DeepSeek V3 (March 2025)DeepSeek

27.7%

28%

27.7%

#40

Grok 3 xAI

27%

#41

Grok 2 xAI

23.6%

24%

23.6%

#42

Llama 3.2 3B Instruct Meta

23.4%

23%

23.4%

#43

o1 Preview OpenAI

23%

#44

Llama 4 Scout Meta

16.9%

17%

16.9%

#45

Llama 3.2 1B Instruct Meta

8.2%

关于 BigCodeBench

全名: BigCodeBench (Hard)
类别: Coding
指标: pass@1 (%)
评分范围: 0%–100%
人类基准: 尚未确定
状态: 启用

Frequently Asked Questions

BigCodeBench is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.

Claude Opus 4.6 currently holds the top score on the BigCodeBench benchmark. See our full rankings table above for the complete leaderboard with 45 models.

We update benchmark data from multiple sources including HuggingFace open-source model leaderboards and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.

No. While BigCodeBench is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.

BigCodeBench (Hard) 排行榜

模型排名

关于 BigCodeBench

相关基准测试

BigCodeBench (Hard) 排行榜

模型排名

关于 BigCodeBench

相关基准测试