Can a model resolve real GitHub issues from popular Python repositories? Human-validated subset ensures accurate evaluation. Tests end-to-end software engineering ability.
Why it matters: The gold standard for real-world coding ability. Unlike HumanEval, tests understanding of large codebases, debugging, and complex changes. Scores range 20-80%.
Top Model
95%
Claude Fable 5
Average Score
62.0%
Across 52 models
Models Tested
52
Metric: resolve rate
Human Baseline
-
Score Range: 0%–100%
SWE-bench Verified Scores - Top 25 Models
Ranked by SWE-bench Verified score (%)
All models with a reported SWE-bench Verified score, ranked by highest resolve rate.
SWE-bench Verified is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.
Claude Fable 5 currently holds the top score on the SWE-bench Verified benchmark. See our full rankings table above for the complete leaderboard with 52 models.
We update benchmark data from multiple sources including HuggingFace open-source model leaderboards and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.
No. While SWE-bench Verified is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.