Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking (arxiv.org)

🤖 AI Summary
The paper introduces Li2, a provable mathematical framework that explains feature emergence during grokking (delayed generalization) in 2‑layer nonlinear networks and derives scaling laws linking memorization and generalization to hyperparameters. Li2 decomposes training into three stages—Lazy learning, Independent feature learning, and Interactive feature learning—and shows how initially overfit top-layer weights (with weight decay) shape the backpropagated gradient GF so that hidden units can learn meaningful features rather than pure noise. This provides a first-principles account of when and which features will emerge, why generalization can suddenly appear long after training error goes to zero, and how sample size, learning rate, and weight decay control that transition. Technically, the authors prove that in the independent phase each hidden node’s dynamics follow exact gradient ascent on a defined energy function E whose local maxima correspond to emergent, generalizable features; later, when hidden nodes interact, GF adapts to focus on missing features, explaining the shift from memorization to structured representation. Analysis on group arithmetic tasks yields provable scaling laws for feature emergence and shows how optimizer design (e.g., Muon) can accelerate the desirable dynamics. The theory is extendable to deeper architectures, giving actionable insights into hyperparameter choices and optimizer effects in models that exhibit grokking.
Loading comments...
loading comments...