Yet another batch of meaningless autonomous agent benchmarks :)

date: 2026-06-12

tags: [#ai, #agents, #benchmarks, #llm ]

draft: false

---

https://github.com/korchasa/flowai-experiments/tree/main/agents-comparison

Illustration: autonomous agent benchmarks

Burned 40% of my limit running benchmarks close to my real tasks on opus/fable/gpt-5.5 — fully autonomous agent work: app generation from scratch, a project audit, and three implementation tasks of varying difficulty.

What can be said with at least some confidence:

fable beats opus-4.8 and gpt-5.5 on result quality. My working hypothesis: fable medium = opus xhigh.
opus xhigh is unexpectedly expensive due to overly long reasoning. Sometimes more expensive than fable.
Looks are still a pain. Everything is dark-neon-identical.
Proper testing will take 1-2 weekly limits on claude x20.

Hypotheses:

The best model choice will depend on the project’s stage of development.
In some cases, more expensive but higher-quality models can be justified even on cost over the span of a single task, without counting technical debt.
Beyond some threshold, longer reasoning no longer improves quality — it only increases cost.