最近更新: 4h ago

Coding 基准测试

Software Engineering Benchmark (Verified) 排行榜

Can a model resolve real GitHub issues from popular Python repositories? Human-validated subset ensures accurate evaluation. Tests end-to-end software engineering ability.

为什么重要： The gold standard for real-world coding ability. Unlike HumanEval, tests understanding of large codebases, debugging, and complex changes. Scores range 20-80%.

顶级模型

95%

Claude Fable 5

平均评分

62.0%

共52个模型

已测试模型

指标: resolve rate

人类基准

评分范围: 0%–100%

SWE-bench Verified Scores - Top 25 Models

Ranked by SWE-bench Verified score (%)

LMMarketCap.com

模型排名

All models with a reported SWE-bench Verified score, ranked by highest resolve rate.

排名模型评分性能

Claude Fable 5 Anthropic

95%

GPT-5.5 OpenAI

88.7%

89%

88.7%

GPT-5.5 Pro OpenAI

88.7%

89%

88.7%

Claude Opus 4.8 Anthropic

88.6%

89%

88.6%

Claude Opus 4.7 Anthropic

87.6%

88%

87.6%

Claude Opus 4.6 Anthropic

83.7%

84%

83.7%

Claude Opus 4.5 Anthropic

80.9%

81%

80.9%

Gemini 3.1 Pro Google

80.6%

81%

80.6%

DeepSeek V4 Pro DeepSeek

80.6%

81%

80.6%

#10

GPT-5.4 OpenAI

80%

#11

Claude Sonnet 4.6 Anthropic

79.6%

80%

79.6%

#12

Gemini 3 Flash Google

78%

#12

GPT-5.2 OpenAI

78%

#14

Claude Sonnet 4.5 Anthropic

77.2%

77%

77.2%

#15

GPT-5.1 OpenAI

76.5%

77%

76.5%

#16

Gemini 3 Pro Google

76.2%

76%

76.2%

#17

MiniMax M2.5 MiniMax

75.8%

76%

75.8%

#18

GPT-5 OpenAI

75%

#19

GLM 5 Zhipu AI

72.8%

73%

72.8%

#20

Claude Sonnet 4 Anthropic

72.7%

73%

72.7%

#21

Claude Opus 4 Anthropic

72.5%

73%

72.5%

#22

o3 OpenAI

71.7%

72%

71.7%

#23

Kimi K2 5

70.8%

71%

70.8%

#24

Claude 3.7 Sonnet Anthropic

70.3%

70%

70.3%

#25

Grok 4 xAI

70%

#25

DeepSeek V3.2 DeepSeek

70%

#27

o4-mini OpenAI

68.1%

68%

68.1%

#28

Claude Haiku 4.5 Anthropic

66.6%

67%

66.6%

#29

Gemini 2.5 Pro Google

63.8%

64%

63.8%

#30

Kimi K2 0711 Moonshot AI

63.4%

63%

63.4%

#31

MiniMax M2 MiniMax

61%

#32

Gemini 2.5 Flash Google

60.4%

60%

60.4%

#33

GPT-5 Mini OpenAI

59.8%

60%

59.8%

#34

DeepSeek R1-0528 DeepSeek

57.6%

58%

57.6%

#35

Devstral Small

56.4%

56%

56.4%

#36

Qwen 3 Coder

55.4%

55%

55.4%

#36

Glm 4

55.4%

55%

55.4%

#38

GPT-4.1 OpenAI

54.6%

55%

54.6%

#39

Devstral

53.8%

54%

53.8%

#40

o3-mini OpenAI

49.3%

49%

49.3%

#41

DeepSeek R1 DeepSeek

49.2%

49%

49.2%

#42

Claude 3.5 Sonnet Anthropic

49%

#43

o1 OpenAI

48.9%

49%

48.9%

#44

DeepSeek V3 DeepSeek

42%

#45

GPT-5 Nano OpenAI

34.8%

35%

34.8%

#46

GPT-4o OpenAI

30.8%

31%

30.8%

#47

gpt-oss-120b OpenAI

26%

#48

GPT-4.1 Mini OpenAI

23.9%

24%

23.9%

#49

Llama 4 Maverick Meta

21%

#50

Gemini 2.0 Flash Google

13.5%

14%

13.5%

#51

Llama 4 Scout Meta

9.1%

#52

Qwen 2.5 Coder 32B Alibaba

关于 SWE-bench Verified

全名: Software Engineering Benchmark (Verified)
类别: Coding
指标: resolve rate (%)
评分范围: 0%–100%
人类基准: 尚未确定
状态: 启用

Frequently Asked Questions

SWE-bench Verified is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.

Claude Fable 5 currently holds the top score on the SWE-bench Verified benchmark. See our full rankings table above for the complete leaderboard with 52 models.

We update benchmark data from multiple sources including HuggingFace open-source model leaderboards and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.

No. While SWE-bench Verified is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.

Software Engineering Benchmark (Verified) 排行榜

模型排名

关于 SWE-bench Verified

相关基准测试

Software Engineering Benchmark (Verified) 排行榜

模型排名

关于 SWE-bench Verified

相关基准测试