使用SWE-bench Verified、HumanEval和BigCodeBench评分对AI模型的编程能力进行排名。未进行基准测试的模型使用Arena Elo回退。
Claude Opus 4.6
评分: 83.9
57.5
所有排名模型
112
有基准测试数据
Top Best for Coding Models by Weighted Score
Top 15 models by weighted score
Benchmark Breakdown
Per-benchmark scores for top 10 models
| # | 模型 | 评分 |
|---|---|---|
| 1 | Claude Opus 4.6Anthropic | 83.9 |
| 2 | Claude Sonnet 4.6Anthropic | 80.9 |
| 3 | GPT-5.4OpenAI | 78.7 |
| 4 | Claude Opus 4.5Anthropic | 78.3 |
| 5 | GPT-5.2OpenAI | 77.5 |
| 6 | GPT-5.1OpenAI | 76.7 |
| 7 | Claude Sonnet 4.5Anthropic | 76.2 |
| 8 | Claude Fable 5Anthropic | 76 |
| 9 | GPT-5OpenAI | 75.8 |
| 10 | Gemini 3 Flash PreviewGoogle | 75.6 |
| 11 | o3OpenAI | 74.3 |
| 12 | Claude Opus 4Anthropic | 73.9 |
| 13 | o4 MiniOpenAI | 71.7 |
| 14 | GPT-5.5OpenAI | 71 |
| 15 | GPT-5.5 ProOpenAI | 71 |
| 16 | Claude Opus 4.8Anthropic | 70.9 |
| 17 | Claude Opus 4.7Anthropic | 70.1 |
| 18 | GPT-4o-miniOpenAI | 69.8 |
| 19 | Claude Haiku 4.5Anthropic | 68.9 |
| 20 | Claude Sonnet 4Anthropic | 66.9 |
| 21 | Gemini 2.5 FlashGoogle | 65.8 |
| 22 | Gemini 3.1 Pro PreviewGoogle | 64.5 |
| 23 | DeepSeek V4 ProDeepSeek | 64.5 |
| 24 | GPT-4 TurboOpenAI | 60.9 |
| 25 | Llama 3.3 70B InstructMeta | 60.9 |
| 26 | MiniMax M2.5MiniMax | 60.6 |
| 27 | GPT-4OpenAI | 60.5 |
| 28 | Gemini 3.5 Flash(回退)Google | 60 |
| 29 | GPT-5.2 Chat(回退)OpenAI | 60 |
| 30 | GLM 5.1(回退)Zhipu AI | 60 |
| 31 | MiMo-V2.5-Pro(回退)Xiaomi | 60 |
| 32 | Qwen3.7 Plus(回退)Alibaba | 60 |
| 33 | Kimi K2.6(回退)Moonshot AI | 60 |
| 34 | Qwen3.6 Max Preview(回退)Alibaba | 60 |
| 35 | GLM 5(回退)Zhipu AI | 60 |
| 36 | Gemma 4 31B(回退)Google | 60 |
| 37 | Claude Opus 4.1(回退)Anthropic | 60 |
| 38 | MiniMax M3(回退)MiniMax | 60 |
| 39 | Qwen3.6 Plus(回退)Alibaba | 60 |
| 40 | Qwen3.5 397B A17B(回退)Alibaba | 60 |
| 41 | Grok 4.3(回退)xAI | 60 |
| 42 | GLM 4.7(回退)Zhipu AI | 60 |
| 43 | Gemma 4 26B A4B (回退)Google | 60 |
| 44 | DeepSeek V4 Flash(回退)DeepSeek | 60 |
| 45 | MiMo-V2.5(回退)Xiaomi | 60 |
| 46 | Gemini 3.1 Flash Lite Preview(回退)Google | 60 |
| 47 | Mistral Medium 3.5(回退)Mistral AI | 60 |
| 48 | GPT-5 Chat(回退)OpenAI | 60 |
| 49 | GLM 4.6(回退)Zhipu AI | 60 |
| 50 | DeepSeek V3.2 Exp(回退)DeepSeek | 60 |
| 51 | DeepSeek V3.1(回退)DeepSeek | 60 |
| 52 | Qwen3.5-122B-A10B(回退)Alibaba | 60 |
| 53 | MiniMax M2.7(回退)MiniMax | 60 |
| 54 | DeepSeek V3.1 Terminus(回退)DeepSeek | 60 |
| 55 | Qwen3 VL 235B A22B Instruct(回退)Alibaba | 60 |
| 56 | Hy3 preview(回退)Tencent | 60 |
| 57 | GLM 4.5(回退)Zhipu AI | 60 |
| 58 | Qwen3.5-27B(回退)Alibaba | 60 |
| 59 | Qwen3 Next 80B A3B Instruct(回退)Alibaba | 60 |
| 60 | Qwen3.5-Flash(回退)Alibaba | 60 |
| 61 | Qwen3.5-35B-A3B(回退)Alibaba | 60 |
| 62 | Qwen3 VL 235B A22B Thinking(回退)Alibaba | 60 |
| 63 | Step 3.5 Flash(回退)StepFun | 60 |
| 64 | GLM 4.6V(回退)Zhipu AI | 60 |
| 65 | GLM 4.5 Air(回退)Zhipu AI | 60 |
| 66 | Qwen3 Next 80B A3B Thinking(回退)Alibaba | 60 |
| 67 | Trinity Large Thinking(回退)arcee-ai | 60 |
| 68 | GLM 4.7 Flash(回退)Zhipu AI | 60 |
| 69 | MiniMax M1(回退)MiniMax | 60 |
| 70 | o3 Mini High(回退)OpenAI | 60 |
| 71 | Command A(回退)Cohere | 60 |
| 72 | GLM 4.5V(回退)Zhipu AI | 60 |
| 73 | Qwen3 8B(回退)Alibaba | 60 |
| 74 | Mercury 2(回退)Inception | 60 |
| 75 | Llama 3.3 Nemotron Super 49B V1.5(回退)NVIDIA | 60 |
| 76 | Nova 2 Lite(回退)Amazon | 60 |
| 77 | gpt-oss-20b(回退)OpenAI | 60 |
| 78 | Mistral Large 2407(回退)Mistral AI | 60 |
| 79 | Granite 4.1 8B(回退)IBM | 60 |
| 80 | Olmo 3 32B Think(回退)Allen AI | 60 |
| 81 | GPT-4.1OpenAI | 58.8 |
| 82 | Phi 4Microsoft | 57.6 |
| 83 | Llama 3.1 70B InstructMeta | 57 |
| 84 | o1OpenAI | 57 |
| 85 | DeepSeek V3DeepSeek | 56.6 |
| 86 | DeepSeek V3.2DeepSeek | 56 |
| 87 | Gemma 2 27BGoogle | 55.6 |
| 88 | Mistral LargeMistral AI | 54.9 |
| 89 | GPT-4oOpenAI | 54.7 |
| 90 | Claude 3 HaikuAnthropic | 52.3 |
| 91 | DeepSeek V3 0324DeepSeek | 50.5 |
| 92 | Llama 4 MaverickMeta | 50.2 |
| 93 | MiniMax M2MiniMax | 48.8 |
| 94 | GPT-5 MiniOpenAI | 47.8 |
| 95 | R1 0528DeepSeek | 46.1 |
| 96 | Llama 3.1 8B InstructMeta | 46 |
| 97 | Gemini 2.5 ProGoogle | 44.3 |
| 98 | Llama 3 8B InstructMeta | 42.1 |
| 99 | GPT-4o (2024-11-20)OpenAI | 38.4 |
| 100 | o3 MiniOpenAI | 38.1 |
| 101 | GPT-4o-mini (2024-07-18)OpenAI | 36.9 |
| 102 | R1DeepSeek | 36.8 |
| 103 | GPT-4.1 MiniOpenAI | 31.2 |
| 104 | Llama 4 ScoutMeta | 30.9 |
| 105 | Qwen2.5 7B InstructAlibaba | 30.1 |
| 106 | Command R+ (08-2024)Cohere | 29.7 |
| 107 | R1 Distill Llama 70BDeepSeek | 28.2 |
| 108 | GPT-5 NanoOpenAI | 27.8 |
| 109 | GPT-4.1 NanoOpenAI | 22.7 |
| 110 | gpt-oss-120bOpenAI | 20.8 |
| 111 | Llama 3.2 3B InstructMeta | 18.7 |
| 112 | Llama 3.2 1B InstructMeta | 6.6 |
每个模型的评分是其可用基准测试结果的加权平均值。当模型缺少某些基准测试时,权重会在可用的基准测试之间重新归一化。 没有任何主要基准测试数据的模型将回退到Arena Elo(归一化到0-100)并相应标记。 所有评分均为0-100的刻度。数据来源于官方模型卡片、已发表论文和第三方评估平台。
根据我们的基准测试分析,Anthropic的Claude Opus 4.6目前在Coding领域排名第一,加权评分为83.9/100。
模型使用SWE-bench Verified、HumanEval、BigCodeBench基准测试分数的加权平均进行排名。没有主要基准测试数据的模型会回退到Arena Elo。所有分数均归一化到0-100的刻度。
我们目前对112个拥有相关基准测试数据的模型进行了Coding任务排名。