đ¤ AI Summary
A thought piece contrasts how pretraining (supervised next-token prediction) and reinforcement learning (policy-gradient-style RL) convert compute into information, arguing that RL is vastly less sample-efficient for most of training. The author frames learning efficiency as Bits/FLOP = Samples/FLOP Ă Bits/Sample and shows we usually ignore the latter term. In supervised learning each token yields -log(p) bits (youâre told the correct label), whereas RL with a single binary reward yields at most the binary entropy Entropy(p) = -p log p - (1-p) log(1-p) bits. With a large vocab (âź100k) an untrained model has pâ1/100k, meaning RL both needs to unroll long trajectories (10sâ1000s of tokens) to get one reward and has vanishing information per sample; variance forces massive batch sizes (order 300k samples to see a rare correct token with high confidence). Scaling laws make improvements exponential in compute, so the regime where RLâs bits/sample rivals pretraining is a tiny slice near the end of training.
The practical upshot for RL/ RL+Value/Reward (RLVR) work: keep models in a âGoldilocksâ pass-rate zone. Effective strategies include pretraining and inference scaling (to boost baseline pass rates), curriculum learning, self-play (keeps pâ0.5), value functions or proxy evaluations that estimate rewards earlier (Samples/FLOP density), and process-reward proxies that credit partial progress (Bits/Sample density). RL does produce highly task-relevant âfewer but valuableâ bitsâreasoning traces and correction strategiesânot captured by pretraining alone, but it also risks producing jagged heuristics because rare, generalizable trajectories are seldom sampled. Addressing that requires denser intermediate signals or new training paradigms beyond brute-force RL scaling.
Loading comments...
login to comment
loading comments...
no comments yet