🤖 AI Summary
Researchers behind an upcoming NeurIPS paper analyze a simple linear bigram next-token prediction model to derive “non-power-law” scaling laws that explain why optimization gets harder as vocabulary size grows. Focusing on a one-hot input/output squared-loss model with Zipfian token frequencies (π_i ∝ 1/i), they show the Hessian is diagonal with eigenvalues equal to token marginals and the loss decomposes into d² independent directions. Crucially, the initial optimality gap and per-iteration progress both depend on vocabulary size through harmonic numbers H_d, so the usual power-law asymptotics (which assume problem difficulty is dimension-independent) break down. By normalizing the loss (r_d(k) = (F_k−F*)/(F_0−F*)) and picking η = H_d, they reduce the dynamics to r_d(k) = (1/H_d) ∑_{i=1}^d (1/i)(1−1/i)^{2k}, approximate that sum by an integral, and show meaningful limits arise only if iteration count k is scaled with d. In particular they obtain the heuristic scaling r_d(k) ≈ 1 − (log(2k))/log(d), capturing the slow, logarithmic progress as vocabulary grows.
The main implication for the AI/ML community is practical and theoretical: Zipfian token statistics make vanilla gradient descent increasingly inefficient with vocabulary size, providing a clean setting where sign-based methods (a proxy for Adam) provably scale better. The work therefore offers a principled explanation for empirical observations that Adam/AdamW outperform SGD on large language models and argues that scaling laws must normalize both loss scale and time (k) against dimension to compare optimizers meaningfully.
Loading comments...
login to comment
loading comments...
no comments yet