Tests AI models on complex terminal-based tasks including shell commands, debugging, system administration, and multi-step CLI workflows.
为什么重要: Measures agentic capability in terminal environments — critical for AI coding assistants that execute commands and manage development workflows.
顶级模型
61.7%
Composer 2
平均评分
61.7%
共1个模型
已测试模型
1
指标: pass rate
人类基准
—
评分范围: 0%–100%
All models with a reported Terminal-Bench score, ranked by highest pass rate.
Terminal-Bench is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.
Composer 2 currently holds the top score on the Terminal-Bench benchmark. See our full rankings table above for the complete leaderboard with 1 models.
We update benchmark data from multiple sources including HuggingFace Open LLM Leaderboard and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.
No. While Terminal-Bench is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.