Beyond Power Laws: Scaling Laws for Next-Token Prediction (francisbach.com)

🤖 AI Summary
Researchers behind an upcoming NeurIPS paper analyze a simple linear bigram next-token prediction model to derive “non-power-law” scaling laws that explain why optimization gets harder as vocabulary size grows. Focusing on a one-hot input/output squared-loss model with Zipfian token frequencies (π_i ∝ 1/i), they show the Hessian is diagonal with eigenvalues equal to token marginals and the loss decomposes into d² independent directions. Crucially, the initial optimality gap and per-iteration progress both depend on vocabulary size through harmonic numbers H_d, so the usual power-law asymptotics (which assume problem difficulty is dimension-independent) break down. By normalizing the loss (r_d(k) = (F_k−F*)/(F_0−F*)) and picking η = H_d, they reduce the dynamics to r_d(k) = (1/H_d) ∑_{i=1}^d (1/i)(1−1/i)^{2k}, approximate that sum by an integral, and show meaningful limits arise only if iteration count k is scaled with d. In particular they obtain the heuristic scaling r_d(k) ≈ 1 − (log(2k))/log(d), capturing the slow, logarithmic progress as vocabulary grows. The main implication for the AI/ML community is practical and theoretical: Zipfian token statistics make vanilla gradient descent increasingly inefficient with vocabulary size, providing a clean setting where sign-based methods (a proxy for Adam) provably scale better. The work therefore offers a principled explanation for empirical observations that Adam/AdamW outperform SGD on large language models and argues that scaling laws must normalize both loss scale and time (k) against dimension to compare optimizers meaningfully.
Loading comments...
loading comments...