Scaling pretraining affects RL sample efficiency (www.runrl.com)

πŸ€– AI Summary
Researchers tested how supervised pretraining interacts with reinforcement learning by warm-starting tiny Tic-Tac-Toe transformers (two transformer blocks; policy sizes: 451k, 798k, 1.79M parameters, half the weights in a value head) on perfect minimax moves (0–100 gradient steps of cross-entropy) and then fine-tuning with PPO against a deterministic minimax opponent (up to 4,000 episodes). Key findings: larger models reap far bigger RL savings from pretraining (the 1.79M model cut PPO needs roughly 3–4Γ— at 100 pretrain steps for a 97% draw target; e.g., from ~800β†’200 episodes), while the smallest model (451k) saturates and gains <40% even at 100 steps. Crucially, small amounts of pretraining can hurt RL: 20-step checkpoints sometimes increased PPO episodes (medium: 1,700β†’2,200; big: ~1,100β†’1,300 for high-draw targets) until a threshold (~40–80 steps) where benefits appear. Aggregating all runs yields a single log-quadratic frontier relating pretrain cross-entropy loss to the remaining PPO episodes to reach skill thresholds (concave in log loss, not a simple power law). Practically, that means pretrain loss can predict RL compute budgets and shows diminishing returns once a model hits a capacity floorβ€”so pretraining only reduces RL costs when the model has expressive headroom. Limitations: toy environment, five seeds, flawless labels; results are indicative but suggest frontier LLMs could use pretrain-loss-based budgeting to trade supervised compute vs RLVR interaction efficiently.
Loading comments...
loading comments...