Last updated: 2h ago

Coding benchmark

BigCodeBench (Hard) Leaderboard

Practical code generation requiring use of libraries, APIs, and complex program structures. The 'Hard' subset tests non-trivial engineering tasks.

Why it matters: More realistic than HumanEval — tests practical programming skills including library usage, API calls, and multi-file reasoning.

Top Model

72.1%

Claude Opus 4.6

Average Score

38.1%

Across 50 models

Models Tested

Metric: pass@1

Human Baseline

Score Range: 0%–100%

BigCodeBench Scores - Top 25 Models

Ranked by BigCodeBench score (%)

LMMarketCap.com

Model Rankings

All models with a reported BigCodeBench score, ranked by highest pass@1.

RankModelScorePerformance

Claude Opus 4.6 Anthropic

72.1%

72%

72.1%

Claude Sonnet 4.6 Anthropic

68.4%

68%

68.4%

Qwen 2.5 Coder 32B Alibaba

60.2%

60%

60.2%

GPT-4o OpenAI

51.1%

51%

51.1%

DeepSeek V3 DeepSeek

50%

Llama 4 Maverick Meta

49.7%

50%

49.7%

GPT-4.1 Mini OpenAI

48.9%

49%

48.9%

GPT-4 Turbo OpenAI

48.2%

48%

48.2%

GPT-4o (2024-11-20)OpenAI

48%

#10

Llama 3.3 70B Meta

46.9%

47%

46.9%

#11

Claude 3.5 Sonnet Anthropic

46.8%

47%

46.8%

#12

Claude 3.5 Haiku Anthropic

46.1%

46%

46.1%

#12

Llama 3.1 70B Meta

46.1%

46%

46.1%

#12

GPT-4o-mini (2024-07-18)OpenAI

46.1%

46%

46.1%

#15

GPT-4 OpenAI

46%

#16

Gemini 2.0 Flash Google

45.9%

46%

45.9%

#17

Qwen 2.5 72B Alibaba

45.8%

46%

45.8%

#18

Claude 3 Opus Anthropic

45.5%

46%

45.5%

#18

Phi-4 Microsoft

45.5%

46%

45.5%

#20

R1 Distill Qwen 32B DeepSeek

43.9%

44%

43.9%

#21

Gemini 1.5 Pro Google

43.8%

44%

43.8%

#22

Llama 3 70B Instruct Meta

43.6%

44%

43.6%

#23

Mixtral 8x22B Mistral AI

40.6%

41%

40.6%

#24

Claude 3 Haiku Anthropic

39.4%

39%

39.4%

#25

Qwen2.5 7B Instruct Alibaba

37.6%

38%

37.6%

#26

Command R (08-2024)Cohere

37.1%

37%

37.1%

#27

R1 Distill Llama 70B DeepSeek

35.3%

35%

35.3%

#28

Command R+Cohere

33.8%

34%

33.8%

#29

o3-mini OpenAI

33.1%

33%

33.1%

#30

Llama 3.1 8B Instruct Meta

32.8%

33%

32.8%

#31

o1 OpenAI

32.4%

32%

32.4%

#31

Claude Sonnet 4 Anthropic

32.4%

32%

32.4%

#33

Llama 3 8B Instruct Meta

31.9%

32%

31.9%

#34

Claude 3.7 Sonnet Anthropic

31.8%

32%

31.8%

#34

GPT-4.1 OpenAI

31.8%

32%

31.8%

#36

Mistral Large 2 Mistral AI

30%

#37

Gemini 2.5 Pro Google

29.7%

30%

29.7%

#37

DeepSeek R1 DeepSeek

29.7%

30%

29.7%

#37

Maestro Reasoning arcee-ai

29.7%

30%

29.7%

#40

GPT-4.1 Nano OpenAI

28.4%

28%

28.4%

#41

o1-mini OpenAI

27.7%

28%

27.7%

#41

DeepSeek V3 (March 2025)DeepSeek

27.7%

28%

27.7%

#43

Grok 3 xAI

27%

#44

Grok 2 xAI

23.6%

24%

23.6%

#44

Grok 3 Mini xAI

23.6%

24%

23.6%

#46

Llama 3.2 3B Instruct Meta

23.4%

23%

23.4%

#47

o1 Preview OpenAI

23%

#48

Gemini 2.0 Flash Lite Google

19.6%

20%

19.6%

#49

Llama 4 Scout Meta

16.9%

17%

16.9%

#50

Llama 3.2 1B Instruct Meta

8.2%

About BigCodeBench

Full Name: BigCodeBench (Hard)
Category: Coding
Metric: pass@1 (%)
Score Range: 0%–100%
Human Baseline: Not established
Status: Active

Frequently Asked Questions

BigCodeBench is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.

Claude Opus 4.6 currently holds the top score on the BigCodeBench benchmark. See our full rankings table above for the complete leaderboard with 50 models.

We update benchmark data from multiple sources including HuggingFace open-source model leaderboards and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.

No. While BigCodeBench is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.

Related Benchmarks

MMLUKnowledge MMLU-ProKnowledge GPQA DiamondReasoning MATH-500Math HumanEvalCoding SWE-bench VerifiedCoding AIME 2024Math GSM8KMath IFEvalInstruction

All Benchmarks|Coding Benchmarks|Compare Models|LLM Leaderboard