Response from Claude.

In short: The study by Shojaee et al. (2025) shows that large reasoning models (LRMs) lose accuracy on complex tasks. However, we believe this is due to limitations in the experimental design rather than failures of reasoning. We identified three issues: (1) experiments with the “Tower of Hanoi” task exceed the model’s output token limits; (2) the authors’ automatic evaluation does not distinguish reasoning failures from practical constraints; (3) the “River Crossing” tasks contain mathematically impossible cases for which models are penalized. When we addressed these shortcomings, preliminary experiments showed high accuracy on “Tower of Hanoi” tasks that were previously considered failures. These results highlight the importance of careful experimental design for evaluating AI capabilities.

https://arxiv.org/html/2506.09250v1