Terminal-Bench 2.0 is out: a stricter benchmark for large-scale evaluation and improvement of terminal AI agents
Terminal-Bench 2.0 is a more complex and better-verified version of the standard terminal benchmark. Harder tasks and stronger quality checks (manual and LM-assisted), unstable cases from 1.0 have been fixed (for example, dependence on YouTube’s ever-changing anti-bot protection). The emphasis is on reproducibility and reliability.
For me it’s interesting primarily because the tasks are closer to infrastructure work than most other benchmarks — at least because of the interface and the tools.
Current leaderboard: https://www.tbench.ai/leaderboard/terminal-bench/2.0
Some interesting bits:
- For the same model, the choice of agent framework can give up to +10–16 p.p., which is comparable to a “jump” one model class up or down.
- In Codex CLI, GPT-5 beats GPT-5-Codex by 10% (49 vs 44). For others, it’s the other way around: GPT-5-Codex wins.
- Vendor agent ≠ best agent:
- OpenAI: yes, Codex CLI really is top.
- Anthropic and Google: no, their native agents (Claude Code, Gemini CLI) systematically lose to Terminus (benchmark authors) / OpenHands / Mini-SWE.
- Today’s capability frontier is [a large closed model] + [a strong framework]. Everything else is basically a trade-off in budget/latency.
- GPT-OSS-20B / 120B and small models are far behind GPT-5 / Claude even with a good agent: at best 18–19% vs 40–50%.
- For the “mid-tier” (Haiku, Gemini Flash, Kimi Instruct, Grok, Qwen, GLM), agent choice is even more critical: the model is weaker, and a good stack pulls it into 25–30% instead of 15–20%.