Thinking through how pretraining vs. RL learn (www.dwarkesh.com)

🤖 AI Summary
A thought piece contrasts how pretraining (supervised next-token prediction) and reinforcement learning (policy-gradient-style RL) convert compute into information, arguing that RL is vastly less sample-efficient for most of training. The author frames learning efficiency as Bits/FLOP = Samples/FLOP × Bits/Sample and shows we usually ignore the latter term. In supervised learning each token yields -log(p) bits (you’re told the correct label), whereas RL with a single binary reward yields at most the binary entropy Entropy(p) = -p log p - (1-p) log(1-p) bits. With a large vocab (∼100k) an untrained model has p≈1/100k, meaning RL both needs to unroll long trajectories (10s–1000s of tokens) to get one reward and has vanishing information per sample; variance forces massive batch sizes (order 300k samples to see a rare correct token with high confidence). Scaling laws make improvements exponential in compute, so the regime where RL’s bits/sample rivals pretraining is a tiny slice near the end of training. The practical upshot for RL/ RL+Value/Reward (RLVR) work: keep models in a “Goldilocks” pass-rate zone. Effective strategies include pretraining and inference scaling (to boost baseline pass rates), curriculum learning, self-play (keeps p≈0.5), value functions or proxy evaluations that estimate rewards earlier (Samples/FLOP density), and process-reward proxies that credit partial progress (Bits/Sample density). RL does produce highly task-relevant “fewer but valuable” bits—reasoning traces and correction strategies—not captured by pretraining alone, but it also risks producing jagged heuristics because rare, generalizable trajectories are seldom sampled. Addressing that requires denser intermediate signals or new training paradigms beyond brute-force RL scaling.
Loading comments...
loading comments...