The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs (arxiv.org)

🤖 AI Summary
The paper challenges the idea that scaling LLMs yields only diminishing returns by showing that small improvements in single-step accuracy compound into exponential gains in the length of tasks a model can complete. The authors isolate "execution" from "reasoning" by supplying models with the required knowledge and a plan, and then measuring how many sequential steps a model can correctly carry out. Larger models execute many more turns even when smaller models show near-perfect single-turn accuracy, but per-step accuracy decays as task length increases. Crucially, this degradation is not solely a context-length problem: the authors identify a self-conditioning effect where models are more likely to err when their own previous (erroneous) outputs appear in context, and simple scaling does not eliminate this effect. For the AI/ML community this reframes evaluation and design priorities: long-horizon performance hinges on execution robustness, error propagation dynamics, and how models handle their own past outputs. "Thinking models" (recent architectures/strategies that avoid self-conditioning) can execute much longer tasks in a single turn, suggesting architectural and inference-time interventions matter as much as scale. Practically, the work advocates benchmarks that measure single-turn execution length, and points to sequential test-time compute and model designs that prevent self-conditioning as high-leverage levers for real-world long-horizon AI systems.
Loading comments...
loading comments...