AI models ranked by coding ability using SWE-bench Verified, HumanEval, and BigCodeBench scores. Fallback to Arena Elo for unbenched models.
Claude Opus 4.6
Score: 83.9
55.8
Across all ranked models
120
With benchmark data
Top Best for Coding Models by Weighted Score
Top 15 models by weighted score
Benchmark Breakdown
Per-benchmark scores for top 10 models
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6Anthropic | 83.9 |
| 2 | Claude Sonnet 4.6Anthropic | 80.9 |
| 3 | GPT-5.4OpenAI | 78.7 |
| 4 | Claude Opus 4.5Anthropic | 78.3 |
| 5 | GPT-5.2OpenAI | 77.5 |
| 6 | GPT-5.1OpenAI | 76.7 |
| 7 | Claude Sonnet 4.5Anthropic | 76.2 |
| 8 | GPT-5OpenAI | 75.8 |
| 9 | Gemini 3 Flash PreviewGoogle | 75.6 |
| 10 | o3OpenAI | 74.3 |
| 11 | Claude Opus 4Anthropic | 73.9 |
| 12 | Grok 4xAI | 72.8 |
| 13 | o4 MiniOpenAI | 71.7 |
| 14 | GPT-4o-miniOpenAI | 69.8 |
| 15 | Claude Haiku 4.5Anthropic | 68.9 |
| 16 | Claude Sonnet 4Anthropic | 66.9 |
| 17 | Claude 3.7 SonnetAnthropic | 65.9 |
| 18 | Gemini 2.5 FlashGoogle | 65.8 |
| 19 | GPT-4 TurboOpenAI | 60.9 |
| 20 | Llama 3.3 70B InstructMeta | 60.9 |
| 21 | MiniMax M2.5MiniMax | 60.6 |
| 22 | GPT-4OpenAI | 60.5 |
| 23 | Claude 3.5 HaikuAnthropic | 60.4 |
| 24 | Gemini 3.1 Pro Preview(fallback)Google | 60 |
| 25 | Claude Opus 4.7(fallback)Anthropic | 60 |
| 26 | GPT-5.2 Chat(fallback)OpenAI | 60 |
| 27 | GPT-5.5(fallback)OpenAI | 60 |
| 28 | GLM 5.1(fallback)Zhipu AI | 60 |
| 29 | Grok 4.1 Fast(fallback)xAI | 60 |
| 30 | MiMo-V2.5-Pro(fallback)Xiaomi | 60 |
| 31 | DeepSeek V4 Pro(fallback)DeepSeek | 60 |
| 32 | Kimi K2.6(fallback)Moonshot AI | 60 |
| 33 | Qwen3.6 Max Preview(fallback)Alibaba | 60 |
| 34 | GLM 5(fallback)Zhipu AI | 60 |
| 35 | Grok 4.3(fallback)xAI | 60 |
| 36 | Gemma 4 31B(fallback)Google | 60 |
| 37 | Claude Opus 4.1(fallback)Anthropic | 60 |
| 38 | Qwen3.6 Plus(fallback)Alibaba | 60 |
| 39 | MiMo-V2-Pro(fallback)Xiaomi | 60 |
| 40 | Qwen3.5 397B A17B(fallback)Alibaba | 60 |
| 41 | GLM 4.7(fallback)Zhipu AI | 60 |
| 42 | Gemini 3.1 Flash Lite Preview(fallback)Google | 60 |
| 43 | Gemma 4 26B A4B (fallback)Google | 60 |
| 44 | DeepSeek V4 Flash(fallback)DeepSeek | 60 |
| 45 | GPT-5 Chat(fallback)OpenAI | 60 |
| 46 | GLM 4.6(fallback)Zhipu AI | 60 |
| 47 | DeepSeek V3.2(fallback)DeepSeek | 60 |
| 48 | DeepSeek V3.2 Exp(fallback)DeepSeek | 60 |
| 49 | MiMo-V2.5(fallback)Xiaomi | 60 |
| 50 | Grok 4 Fast(fallback)xAI | 60 |
| 51 | Qwen3.5-122B-A10B(fallback)Alibaba | 60 |
| 52 | Hy3 preview(fallback)Tencent | 60 |
| 53 | DeepSeek V3.1(fallback)DeepSeek | 60 |
| 54 | DeepSeek V3.1 Terminus(fallback)DeepSeek | 60 |
| 55 | Qwen3 VL 235B A22B Instruct(fallback)Alibaba | 60 |
| 56 | GLM 4.5(fallback)Zhipu AI | 60 |
| 57 | MiniMax M2.7(fallback)MiniMax | 60 |
| 58 | Qwen3.5-27B(fallback)Alibaba | 60 |
| 59 | Qwen3 Next 80B A3B Instruct(fallback)Alibaba | 60 |
| 60 | Qwen3.5-Flash(fallback)Alibaba | 60 |
| 61 | Qwen3.5-35B-A3B(fallback)Alibaba | 60 |
| 62 | Qwen3 VL 235B A22B Thinking(fallback)Alibaba | 60 |
| 63 | Step 3.5 Flash(fallback)StepFun | 60 |
| 64 | Claude 3.7 Sonnet (thinking)(fallback)Anthropic | 60 |
| 65 | Trinity Large Thinking(fallback)arcee-ai | 60 |
| 66 | GLM 4.6V(fallback)Zhipu AI | 60 |
| 67 | Trinity Large Preview(fallback)arcee-ai | 60 |
| 68 | GLM 4.5 Air(fallback)Zhipu AI | 60 |
| 69 | Qwen3 Next 80B A3B Thinking(fallback)Alibaba | 60 |
| 70 | GLM 4.7 Flash(fallback)Zhipu AI | 60 |
| 71 | MiniMax M1(fallback)MiniMax | 60 |
| 72 | o3 Mini High(fallback)OpenAI | 60 |
| 73 | Grok 3 Mini Beta(fallback)xAI | 60 |
| 74 | Command A(fallback)Cohere | 60 |
| 75 | GLM 4.5V(fallback)Zhipu AI | 60 |
| 76 | Qwen3 8B(fallback)Alibaba | 60 |
| 77 | Mercury 2(fallback)Inception | 60 |
| 78 | Llama 3.3 Nemotron Super 49B V1.5(fallback)NVIDIA | 60 |
| 79 | Nova 2 Lite(fallback)Amazon | 60 |
| 80 | gpt-oss-20b(fallback)OpenAI | 60 |
| 81 | Mistral Large 2407(fallback)Mistral AI | 60 |
| 82 | Olmo 3 32B Think(fallback)Allen AI | 60 |
| 83 | GPT-4.1OpenAI | 58.8 |
| 84 | Phi 4Microsoft | 57.6 |
| 85 | Llama 3.1 70B InstructMeta | 57 |
| 86 | o1OpenAI | 57 |
| 87 | DeepSeek V3DeepSeek | 56.6 |
| 88 | Gemma 2 27BGoogle | 55.6 |
| 89 | Mistral LargeMistral AI | 54.9 |
| 90 | GPT-4oOpenAI | 54.7 |
| 91 | Llama 3 70B InstructMeta | 54.5 |
| 92 | Grok 3xAI | 52.9 |
| 93 | Claude 3 HaikuAnthropic | 52.3 |
| 94 | DeepSeek V3 0324DeepSeek | 50.5 |
| 95 | Llama 4 MaverickMeta | 50.2 |
| 96 | MiniMax M2MiniMax | 48.8 |
| 97 | GPT-5 MiniOpenAI | 47.8 |
| 98 | R1 0528DeepSeek | 46.1 |
| 99 | Llama 3.1 8B InstructMeta | 46 |
| 100 | Gemini 2.0 FlashGoogle | 46 |
| 101 | Gemini 2.5 ProGoogle | 44.3 |
| 102 | Llama 3 8B InstructMeta | 42.1 |
| 103 | GPT-4o (2024-11-20)OpenAI | 38.4 |
| 104 | o3 MiniOpenAI | 38.1 |
| 105 | GPT-4o-mini (2024-07-18)OpenAI | 36.9 |
| 106 | R1DeepSeek | 36.8 |
| 107 | R1 Distill Qwen 32BDeepSeek | 35.1 |
| 108 | GPT-4.1 MiniOpenAI | 31.2 |
| 109 | Llama 4 ScoutMeta | 30.9 |
| 110 | Qwen2.5 7B InstructAlibaba | 30.1 |
| 111 | Command R (08-2024)Cohere | 29.7 |
| 112 | R1 Distill Llama 70BDeepSeek | 28.2 |
| 113 | GPT-5 NanoOpenAI | 27.8 |
| 114 | Maestro Reasoningarcee-ai | 23.8 |
| 115 | GPT-4.1 NanoOpenAI | 22.7 |
| 116 | gpt-oss-120bOpenAI | 20.8 |
| 117 | Grok 3 MinixAI | 18.9 |
| 118 | Llama 3.2 3B InstructMeta | 18.7 |
| 119 | Gemini 2.0 Flash LiteGoogle | 15.7 |
| 120 | Llama 3.2 1B InstructMeta | 6.6 |
Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. Models without any primary benchmark data fall back to Arena Elo (normalized to 0-100) and are marked accordingly. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.
Based on our benchmark analysis, Claude Opus 4.6 by Anthropic is currently the #1 ranked model for coding, with a weighted score of 83.9/100.
Models are ranked using a weighted average of SWE-bench Verified, HumanEval, BigCodeBench benchmark scores. Models without primary benchmark data fall back to Arena Elo. All scores are normalized to a 0-100 scale.
We currently rank 120 models that have relevant benchmark data for coding tasks.