🤖 AI Summary
The paper shows that Reinforcement Learning’s power to improve LLM reasoning arises from an emergent hierarchical decomposition: models first need to master low-level procedural correctness, then shift into a phase where gains come from discovering and refining high-level strategic plans. This two-phase dynamic explains previously puzzling behaviors—“aha moments” (sudden jumps in performance), length-scaling effects, and entropy patterns—as natural consequences of separating planning from execution. The authors argue that common RL approaches (e.g., GRPO) are inefficient because they apply optimization pressure uniformly across tokens, diluting the learning signal when strategic tokens are the real bottleneck.
To address this, they introduce HIerarchy-Aware Credit Assignment (HICRA), an algorithm that selectively concentrates optimization on high-impact planning tokens rather than treating all tokens equally. HICRA yields significant gains over strong baselines, showing that targeting the strategic bottleneck unlocks advanced reasoning more effectively. The paper also promotes semantic entropy (measuring meaning-level exploration) as a better guide for strategic exploration than token-level entropy. Implications are broad: hierarchy-aware credit assignment and semantic exploration metrics can improve sample efficiency, accelerate emergent problem-solving, and steer future RLHF and fine-tuning regimes toward explicitly separating planning and execution in LLMs.
Loading comments...
login to comment
loading comments...
no comments yet