Extension of SWE-bench to multiple programming languages beyond Python, testing real-world bug fixing across TypeScript, Java, Go, Rust and more.
为什么重要: Most real codebases are polyglot. This benchmark tests whether coding models can handle the diversity of languages seen in production software engineering.
顶级模型
73.7%
Composer 2
平均评分
73.7%
共1个模型
已测试模型
1
指标: resolved rate
人类基准
—
评分范围: 0%–100%
All models with a reported SWE-bench ML score, ranked by highest resolved rate.
SWE-bench ML is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.
Composer 2 currently holds the top score on the SWE-bench ML benchmark. See our full rankings table above for the complete leaderboard with 1 models.
We update benchmark data from multiple sources including HuggingFace Open LLM Leaderboard and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.
No. While SWE-bench ML is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.