Terminal-Bench 2.0 released: a more rigorous benchmark for large-scale evaluation and improvement of AI agents in the terminal.

Terminal-Bench 2.0 - a more complex and better-verified version of the standard benchmark for terminal tasks. More complex tasks and enhanced quality checks (manual and LM-assisted), unstable cases from 1.0 fixed (for example, dependence on changing YouTube anti-bot protection). Emphasis on reproducibility and reliability.

For me, it is primarily interesting because the tasks are closer to infrastructure-related ones than other benchmarks, if only because of the interface and tools.

You can view the current top on the page: https://www.tbench.ai/leaderboard/terminal-bench/2.0

Interesting points:

  • For the same model, the choice of agent framework gives up to +10–16 percentage points, which is comparable to a “jump” up or down a model class.
  • For Codex CLI, gpt-5 wins by 10% (49 vs 44) over gpt-5-codex. For others, gpt-5-codex wins.
  • Vendor agent ≠ best:
    • OpenAI: yes, Codex CLI is indeed top.
    • Anthropic and Google: no, their native agents (Claude Code, Gemini CLI) systematically lose to Terminus (benchmark authors)/OpenHands/Mini-SWE.
  • The boundary of capabilities today is the combination of [large closed model] + [strong framework]. Everything else is essentially a compromise on budget/latency.
  • GPT-OSS-20B/120B and small models fall far short of GPT-5/Claude even with a good agent: maximum 18–19% vs 40–50%.
  • For the “middle class” (Haiku, Gemini Flash, Kimi Instruct, Grok, Qwen, GLM), the choice of agent is even more critical: the model itself is weaker, and a good stack pulls it into the 25–30% range instead of 15–20%.