🤖 AI Summary
This paper questions whether popular “RL” post-training recipes for LLMs are genuinely leveraging reinforcement learning or merely rebranding outcome-driven supervised learning. Focusing on recent work like DeepSeek R1 (which used GRPO), the authors dissect two common structural assumptions used to cast LLM fine-tuning as an MDP: (1) treating states as the concatenation of past tokens (context window) and actions as individual output tokens, and (2) splitting a scalar trajectory reward uniformly across all time steps. They show these assumptions produce a degenerate MDP where credit assignment collapses and the learning objective is effectively equivalent to supervised learning on final outcomes rather than true sequential decision-making.
Empirically, using Qwen-2.5 on benchmarks such as GSM8K and Countdown, iterative supervised fine-tuning that incorporates both positive and negative samples matches performance of GRPO-based training. The paper also argues that the MDP modeling nudges models to produce longer intermediate token traces (feeding the “RL produces longer thinking traces” narrative) without proving improved internal reasoning. Significance: the work calls for more careful MDP formulations, meaningful credit assignment and controls when claiming RL-driven reasoning gains, and urges the community to re-evaluate claims that RL post-training inherently induces better chain-of-thought or reasoning capabilities.
Loading comments...
login to comment
loading comments...
no comments yet