Terminal-Bench 2.0 is out: a stricter benchmark for large-scale evaluation and improvement of terminal AI agents

Terminal-Bench 2.0 is a more complex and better-verified version of the standard terminal benchmark. Harder tasks and stronger quality checks (manual and LM-assisted), unstable cases from 1.0 have been fixed (for example, dependence on YouTube’s ever-changing anti-bot protection). The emphasis is on reproducibility and reliability.

For me it’s interesting primarily because the tasks are closer to infrastructure work than most other benchmarks — at least because of the interface and the tools.

Current leaderboard: https://www.tbench.ai/leaderboard/terminal-bench/2.0

Some interesting bits:

  • For the same model, the choice of agent framework can give up to +10–16 p.p., which is comparable to a “jump” one model class up or down.
  • In Codex CLI, GPT-5 beats GPT-5-Codex by 10% (49 vs 44). For others, it’s the other way around: GPT-5-Codex wins.
  • Vendor agent ≠ best agent:
    • OpenAI: yes, Codex CLI really is top.
    • Anthropic and Google: no, their native agents (Claude Code, Gemini CLI) systematically lose to Terminus (benchmark authors) / OpenHands / Mini-SWE.
  • Today’s capability frontier is [a large closed model] + [a strong framework]. Everything else is basically a trade-off in budget/latency.
  • GPT-OSS-20B / 120B and small models are far behind GPT-5 / Claude even with a good agent: at best 18–19% vs 40–50%.
  • For the “mid-tier” (Haiku, Gemini Flash, Kimi Instruct, Grok, Qwen, GLM), agent choice is even more critical: the model is weaker, and a good stack pulls it into 25–30% instead of 15–20%.