One battle after another: using RL-guided reasoning for next-token prediction (research.nvidia.com)

0 points 11 hours ago ago | visit original

🤖 AI Summary

Researchers introduced Reinforcement Learning Pretraining (RLP), a new pretraining objective that explicitly treats chain-of-thought (CoT) generation as an action and rewards thoughts by how much they improve next-token prediction. At each token step the model samples an internal CoT, conditions its predictor on that CoT, and receives a dense, verifier-free reward equal to the information gain—the increase in log-likelihood—compared to a no-think EMA baseline (dynamic advantage). This reframes reasoning as an exploratory, self-supervised signal that can be computed from ordinary pretraining streams, avoiding curated verifier datasets and allowing position-wise credit assignment at scale. Empirically RLP yields large, durable gains: on Qwen3-1.7B it boosts pretraining benchmark averages by +19% vs base and +17% vs a compute-matched continuous pretraining baseline, and after identical SFT+RLVR post-training the advantage persists (+7–8% relative). RLP scales, too—applied to a 12B Nemotron model trained on ~200B fewer tokens it produced ~35% average improvement and a striking +23 percentage-point jump on science reasoning. Because RLP leverages ordinary corpora (web, textbooks, SFT-style data), it promises a practical, scalable way to bake foundational multi-step reasoning into LLMs rather than adding it only during later alignment stages.

Loading comments...

loading comments...