H1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning (arxiv.org)

🤖 AI Summary
Researchers propose a scalable way to teach large language models to reason across much longer horizons by bootstrapping from abundant short-horizon data. Instead of relying on inference-time scaffolding or expensive step-level labels, they synthetically compose simple problems into multi-step dependency chains of arbitrary length and train with reinforcement learning using only outcome-only rewards (final correctness). A curriculum automatically increases chain complexity so RL can scale without saturating, letting models discover multi-step solutions from sparse feedback. Empirically, curriculum RL on composed GSM8K (6th‑grade math) problems yields large gains on harder, longer-horizon benchmarks — up to 2.06× accuracy improvements on GSM-Symbolic, MATH-500 and AIME — and outperforms baselines even at high pass@k, indicating the models learn new reasoning paths rather than relying on sampling luck. Theoretically, the authors prove that curriculum RL with outcome rewards gives an exponential sample-complexity improvement over full-horizon training, offering training signal comparable to dense supervision. The approach provides a practical path to scale RL for long-horizon reasoning using only existing short-step datasets, lowering annotation costs and enabling better generalization to complex, multi-step tasks.
Loading comments...
loading comments...