🤖 AI Summary
Researchers have identified a key limitation of outcome-based reinforcement learning (RL) methods used to enhance large language models' (LLMs) reasoning: while these methods improve final answer accuracy by rewarding correctness, they cause a harmful reduction in response diversity. This diversity collapse, where models generate increasingly similar outputs, undermines real-world effectiveness, especially as diverse reasoning paths are crucial for scaling LLM performance at test time. By framing RL post-training as a sampling process, the study reveals that this loss of diversity not only affects solved problems but also spreads to unsolved ones, driven by the inherently limited set of distinct answers in reasoning tasks.
To address this, the authors propose outcome-based exploration techniques that incentivize diversity without compromising accuracy. Their approach introduces two novel algorithms—historical exploration, which uses upper confidence bound (UCB) style bonuses to promote rarely seen answers, and batch exploration, which penalizes repetition within the same output batch to encourage varied reasoning routes. Applied to competition-level math problems on LLaMA and Qwen models, these methods effectively enhance accuracy while preserving output diversity. The work also provides a theoretical framework through a model of outcome-based bandits, offering a principled basis for balancing exploitation of correct answers with exploratory diversity. This advancement paves the way for scalable RL strategies that improve LLM reasoning robustness in practical deployments.
Loading comments...
login to comment
loading comments...
no comments yet