Last updated: 4h ago

Coding benchmark

Software Engineering Benchmark (Verified) Leaderboard

Can a model resolve real GitHub issues from popular Python repositories? Human-validated subset ensures accurate evaluation. Tests end-to-end software engineering ability.

Why it matters: The gold standard for real-world coding ability. Unlike HumanEval, tests understanding of large codebases, debugging, and complex changes. Scores range 20-80%.

Top Model

95%

Claude Fable 5

Average Score

62.0%

Across 52 models

Models Tested

Metric: resolve rate

Human Baseline

Score Range: 0%–100%

SWE-bench Verified Scores - Top 25 Models

Ranked by SWE-bench Verified score (%)

LMMarketCap.com

Model Rankings

All models with a reported SWE-bench Verified score, ranked by highest resolve rate.

RankModelScorePerformance

Claude Fable 5 Anthropic

95%

GPT-5.5 OpenAI

88.7%

89%

88.7%

GPT-5.5 Pro OpenAI

88.7%

89%

88.7%

Claude Opus 4.8 Anthropic

88.6%

89%

88.6%

Claude Opus 4.7 Anthropic

87.6%

88%

87.6%

Claude Opus 4.6 Anthropic

83.7%

84%

83.7%

Claude Opus 4.5 Anthropic

80.9%

81%

80.9%

Gemini 3.1 Pro Google

80.6%

81%

80.6%

DeepSeek V4 Pro DeepSeek

80.6%

81%

80.6%

#10

GPT-5.4 OpenAI

80%

#11

Claude Sonnet 4.6 Anthropic

79.6%

80%

79.6%

#12

Gemini 3 Flash Google

78%

#12

GPT-5.2 OpenAI

78%

#14

Claude Sonnet 4.5 Anthropic

77.2%

77%

77.2%

#15

GPT-5.1 OpenAI

76.5%

77%

76.5%

#16

Gemini 3 Pro Google

76.2%

76%

76.2%

#17

MiniMax M2.5 MiniMax

75.8%

76%

75.8%

#18

GPT-5 OpenAI

75%

#19

GLM 5 Zhipu AI

72.8%

73%

72.8%

#20

Claude Sonnet 4 Anthropic

72.7%

73%

72.7%

#21

Claude Opus 4 Anthropic

72.5%

73%

72.5%

#22

o3 OpenAI

71.7%

72%

71.7%

#23

Kimi K2 5

70.8%

71%

70.8%

#24

Claude 3.7 Sonnet Anthropic

70.3%

70%

70.3%

#25

Grok 4 xAI

70%

#25

DeepSeek V3.2 DeepSeek

70%

#27

o4-mini OpenAI

68.1%

68%

68.1%

#28

Claude Haiku 4.5 Anthropic

66.6%

67%

66.6%

#29

Gemini 2.5 Pro Google

63.8%

64%

63.8%

#30

Kimi K2 0711 Moonshot AI

63.4%

63%

63.4%

#31

MiniMax M2 MiniMax

61%

#32

Gemini 2.5 Flash Google

60.4%

60%

60.4%

#33

GPT-5 Mini OpenAI

59.8%

60%

59.8%

#34

DeepSeek R1-0528 DeepSeek

57.6%

58%

57.6%

#35

Devstral Small

56.4%

56%

56.4%

#36

Qwen 3 Coder

55.4%

55%

55.4%

#36

Glm 4

55.4%

55%

55.4%

#38

GPT-4.1 OpenAI

54.6%

55%

54.6%

#39

Devstral

53.8%

54%

53.8%

#40

o3-mini OpenAI

49.3%

49%

49.3%

#41

DeepSeek R1 DeepSeek

49.2%

49%

49.2%

#42

Claude 3.5 Sonnet Anthropic

49%

#43

o1 OpenAI

48.9%

49%

48.9%

#44

DeepSeek V3 DeepSeek

42%

#45

GPT-5 Nano OpenAI

34.8%

35%

34.8%

#46

GPT-4o OpenAI

30.8%

31%

30.8%

#47

gpt-oss-120b OpenAI

26%

#48

GPT-4.1 Mini OpenAI

23.9%

24%

23.9%

#49

Llama 4 Maverick Meta

21%

#50

Gemini 2.0 Flash Google

13.5%

14%

13.5%

#51

Llama 4 Scout Meta

9.1%

#52

Qwen 2.5 Coder 32B Alibaba

About SWE-bench Verified

Full Name: Software Engineering Benchmark (Verified)
Category: Coding
Metric: resolve rate (%)
Score Range: 0%–100%
Human Baseline: Not established
Status: Active

Frequently Asked Questions

SWE-bench Verified is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.

Claude Fable 5 currently holds the top score on the SWE-bench Verified benchmark. See our full rankings table above for the complete leaderboard with 52 models.

We update benchmark data from multiple sources including HuggingFace open-source model leaderboards and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.

No. While SWE-bench Verified is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.

Related Benchmarks

MMLUKnowledge MMLU-ProKnowledge GPQA DiamondReasoning MATH-500Math HumanEvalCoding AIME 2024Math GSM8KMath IFEvalInstruction BBHReasoning

All Benchmarks|Coding Benchmarks|Compare Models|LLM Leaderboard