RL's Razor: Why Online Reinforcement Learning Forgets Less (arxiv.org)

🤖 AI Summary
Researchers show that when adapting models to new tasks, on-policy reinforcement learning (RL) tends to preserve prior capabilities far better than supervised fine-tuning (SFT). The paper demonstrates both empirically (with large language models and robotic foundation models) and theoretically that the amount of forgetting is governed by the distributional shift between the fine-tuned and base policy — measured as the KL divergence evaluated on the new-task distribution. Crucially, on-policy RL is implicitly biased toward solutions that minimize that KL divergence among all policies that solve the new task, whereas SFT can drive the model into distributions arbitrarily far from the original, producing greater catastrophic forgetting. Why this matters: the finding — dubbed “RL’s Razor” — gives a principled explanation for why RL updates often retain prior knowledge and behave more conservatively in policy space. For practitioners, it implies that choosing on-policy RL methods (or explicitly regularizing KL distance) can be an effective strategy to reduce forgetting during adaptation or continual learning, improving safety and robustness of deployed models. The paper’s mix of theoretical argument and cross-domain experiments makes a strong case that the optimization dynamics of on-policy RL naturally favor KL-minimal solutions, with practical consequences for fine-tuning protocols in AI/ML.
Loading comments...
loading comments...