🤖 AI Summary
Researchers introduced Reinforcement Learning Pretraining (RLP), a new pretraining objective that explicitly treats chain-of-thought (CoT) generation as an action and rewards thoughts by how much they improve next-token prediction. At each token step the model samples an internal CoT, conditions its predictor on that CoT, and receives a dense, verifier-free reward equal to the information gain—the increase in log-likelihood—compared to a no-think EMA baseline (dynamic advantage). This reframes reasoning as an exploratory, self-supervised signal that can be computed from ordinary pretraining streams, avoiding curated verifier datasets and allowing position-wise credit assignment at scale.
Empirically RLP yields large, durable gains: on Qwen3-1.7B it boosts pretraining benchmark averages by +19% vs base and +17% vs a compute-matched continuous pretraining baseline, and after identical SFT+RLVR post-training the advantage persists (+7–8% relative). RLP scales, too—applied to a 12B Nemotron model trained on ~200B fewer tokens it produced ~35% average improvement and a striking +23 percentage-point jump on science reasoning. Because RLP leverages ordinary corpora (web, textbooks, SFT-style data), it promises a practical, scalable way to bake foundational multi-step reasoning into LLMs rather than adding it only during later alignment stages.
Loading comments...
login to comment
loading comments...
no comments yet