How Well Does RL Scale? (www.tobyord.com)

🤖 AI Summary
Researchers argue that recent capability gains in reasoning LLMs come from two distinct kinds of scaling: RL-scaling (more reinforcement-learning compute in training) and inference-scaling (more compute spent during deployment, e.g., longer chains-of-thought). Empirical traces from OpenAI’s o1→o3 sequence and other labs show inference-scaling is far more potent per log-scale: a 100× increase in inference compute commonly lifts reasoning-benchmark scores from ~20% to ~80%, while a 100× increase in RL training compute only moves scores roughly from ~33% to ~66%. On a logarithmic axis the RL-scaling slope is about half the inference slope, implying you need ~10× more RL compute to match a 3× inference boost and ~10,000× RL to match a 100× inference gain. That asymmetry has major practical consequences. Early RL improvements were cheap because RL compute started tiny relative to pretraining, but labs (e.g., xAI’s Grok 4) now approach parity between RL and pretraining compute, so further RL scaling would blow up total training costs astronomically — orders of magnitude infeasible (authors estimate ~1,000,000× RL would be needed to match a GPT-level pretraining jump). The net effect: RL’s lasting legacy may be enabling effective inference-scaling (longer reasoning chains), but future capability growth is likely to rely on costly inference compute rather than further RL training — with big implications for deployment cost, safety, and governance.
Loading comments...
loading comments...