Terminal-Bench 2.0 released: a more rigorous benchmark for large-scale evaluation and improvement of AI agents in the terminal.
Terminal-Bench 2.0 - a more complex and better-verified version of the standard benchmark for terminal tasks. More complex tasks and enhanced quality checks (manual and LM-assisted), unstable cases from 1.0 fixed (for example, dependence on changing YouTube anti-bot protection). Emphasis on reproducibility and reliability.
For me, it is primarily interesting because the tasks are closer to infrastructure-related ones than other benchmarks, if only because of the interface and tools.
You can view the current top on the page: https://www.tbench.ai/leaderboard/terminal-bench/2.0
Interesting points:
- For the same model, the choice of agent framework gives up to +10–16 percentage points, which is comparable to a “jump” up or down a model class.
- For Codex CLI, gpt-5 wins by 10% (49 vs 44) over gpt-5-codex. For others, gpt-5-codex wins.
- Vendor agent ≠ best:
- OpenAI: yes, Codex CLI is indeed top.
- Anthropic and Google: no, their native agents (Claude Code, Gemini CLI) systematically lose to Terminus (benchmark authors)/OpenHands/Mini-SWE.
- The boundary of capabilities today is the combination of [large closed model] + [strong framework]. Everything else is essentially a compromise on budget/latency.
- GPT-OSS-20B/120B and small models fall far short of GPT-5/Claude even with a good agent: maximum 18–19% vs 40–50%.
- For the “middle class” (Haiku, Gemini Flash, Kimi Instruct, Grok, Qwen, GLM), the choice of agent is even more critical: the model itself is weaker, and a good stack pulls it into the 25–30% range instead of 15–20%.