Terminal-Bench (Terminal Agent Tasks) 排行榜

Tests AI models on complex terminal-based tasks including shell commands, debugging, system administration, and multi-step CLI workflows.

为什么重要： Measures agentic capability in terminal environments — critical for AI coding assistants that execute commands and manage development workflows.

顶级模型

61.7%

Composer 2

平均评分

61.7%

共1个模型

已测试模型

指标: pass rate

人类基准

评分范围: 0%–100%

Terminal-Bench Scores - Top 1 Models

Ranked by Terminal-Bench score (%)

LMMarketCap.com

模型排名

All models with a reported Terminal-Bench score, ranked by highest pass rate.

排名模型评分性能

Composer 2 Cursor

61.7%

62%

61.7%

关于 Terminal-Bench

全名: Terminal-Bench (Terminal Agent Tasks)
类别: Coding
指标: pass rate (%)
评分范围: 0%–100%
人类基准: 尚未确定
状态: 启用

Frequently Asked Questions

Terminal-Bench is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.

Composer 2 currently holds the top score on the Terminal-Bench benchmark. See our full rankings table above for the complete leaderboard with 1 models.

We update benchmark data from multiple sources including HuggingFace open-source model leaderboards and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.

No. While Terminal-Bench is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.

Terminal-Bench (Terminal Agent Tasks) 排行榜

模型排名

关于 Terminal-Bench

相关基准测试

Terminal-Bench (Terminal Agent Tasks) 排行榜

模型排名

关于 Terminal-Bench

相关基准测试