SFT, RL, and On-Policy Distillation Through a Distributional Lens (nrehiew.github.io)

🤖 AI Summary
A recent discussion explores post-training methods for language models through a distributional lens, focusing on how techniques like supervised fine-tuning (SFT), reinforcement learning (RL), and on-policy distillation (OPD) reshape learning distributions. The author suggests that understanding the "target distribution" is crucial, with SFT aligning models directly with an external dataset, while RL derives updates through a reward system, complicating the target definition. This divergence influences the models' performance and their tendencies toward catastrophic forgetting. The significance lies in OPD's unique position, integrating teacher-student dynamics while leveraging on-policy sampling to improve model performance. In experiments using a Minimal Code Editing task, OPD students trained on both SFT and RL teachers surprisingly outperformed their respective teachers, challenging prior assumptions about the significance of the teacher's performance. These findings imply that while the teacher model informs training, the mechanism of on-policy sampling plays a vital role in retaining capabilities, potentially allowing for overtraining in specialized tasks without substantial loss of generality. This could reshape strategies for enhancing language models effectively and mitigate forgetting during fine-tuning stages.
Loading comments...
loading comments...