The Extreme Inefficiency of RL for Frontier Models (www.tobyord.com)

0 points 3 hours ago ago | visit original

🤖 AI Summary

In the past year the field has quietly shifted from “scale up next-token pre‑training” to heavily using reinforcement learning (RL) post‑training to unlock reasoning and agentic behaviors in frontier models. The striking claim: RL on long, machine-checkable tasks delivers orders-of-magnitude less learnable information per token (and per FLOP) than next‑token prediction. A token in pre‑training carries at most ~16 bits (and realistically ~3 bits late in training); GPT‑4’s scale (~10^12 params, ~10^13 tokens) implies roughly 3 bits of capacity per token. By contrast, modern RL reasoning episodes run tens of thousands to millions of tokens (DeepSeek-R1: 12k–32k; METR HCAST: ~16M tokens/task), often revealing a single binary success/fail signal—i.e., <1 bit per 10k–1M generated tokens. That implies a ~10^3–10^6× drop in information-per-compute when training on long-horizon RL tasks. This matters because such extreme information inefficiency raises practical and theoretical limits on how fast frontier capabilities can improve via RL alone. RL can rapidly drive superhuman performance on narrow tasks, but it’s poor at breadth and transfer compared to self‑supervised pre‑training. Possible mitigations—denser feedback (multi-bit or continuous rewards), intermediate supervision, hybrid curricula—could raise the info signal, but their effectiveness is uncertain. The takeaway: RL has been crucial for recent leaps in reasoning/agency, but sustaining frontier progress at scale will likely require new training paradigms or clever ways to increase the information delivered per episode.

Loading comments...

loading comments...