On-Policy Distillation (thinkingmachines.ai)

🤖 AI Summary
Researchers describe and evaluate on-policy distillation, a post-training method that combines the on-policy relevance of reinforcement learning with the dense, per-token supervision of teacher-model distillation. Instead of learning from teacher trajectories alone (off-policy) or from sparse sequence-level rewards (RL), the student samples its own rollouts and a high-performing teacher grades each token by returning per-token log-probabilities. Training minimizes the per-token reverse KL between the student and teacher distributions (effectively using the negative reverse KL as the per-token advantage, with discount factor zero), pushing the student to match teacher behavior in the exact states it visits. This reduces compounding errors common to off-policy distillation while avoiding the inefficiency of sparse RL rewards. Technically, on-policy distillation is cheap and practical: students generate trajectories, a single forward pass of the large teacher returns log-probs for those tokens, and partial rollouts suffice because rewards are computed per token. Reverse KL is mode-seeking (encouraging the student to adopt the teacher’s concrete behavior) and aligns naturally with RL-style losses; implementation can be a one-line swap in KL-regularized RL pipelines (demonstrated with the Tinker API). The authors replicate Qwen3-style results—matching reasoning-benchmark performance at far lower cost—and show this method is effective for math reasoning and instruction-following assistants without needing separate reward models.
Loading comments...
loading comments...