Reasoning Models Reason Well, Until They Don't (arxiv.org)

🤖 AI Summary
Researchers revisited claims that modern LLMs trained for step-by-step argumentation and self-verification—called Large Reasoning Models (LRMs)—can perform robust, generalized reasoning. They show that prior benchmark wins (e.g., NLGraph) reflect limited benchmark complexity rather than true generalization. To probe this, the team created the Deep Reasoning Dataset (DeepRD), a generative procedure that produces unlimited, tunably complex instances for graph connectivity and natural-language proof-planning tasks. Evaluated on these scalable problems, LRMs exhibit strong performance up to a modest complexity threshold, then suffer abrupt, catastrophic drops with no sign of generalization beyond their training distribution. The paper’s key implication is twofold: in the short term LRMs remain practically useful because most real-world knowledge, interaction, and proof datasets lie within the models’ “success regime,” but the long tail of higher-complexity cases exposes substantial failure potential. Technically, the work highlights that current benchmarks underestimate task difficulty, and that incentive-tuned LLMs still lack mechanisms to systematically scale reasoning capacity. The authors argue for new methods and benchmarks that explicitly test and train models across scalable complexity to close the gap between in-distribution competence and out-of-distribution reasoning robustness.
Loading comments...
loading comments...