đ€ AI Summary
ICLR 2025 authors released a companion website for the paper "Understanding Optimization in Deep Learning with Central Flows," presenting a new explanation for a now-familiar but poorly explained training phenomenon: gradient descent frequently exits the classical âstableâ region (where sharpness S(w) = λ1(H(w)) < 2/η), briefly oscillates, and yet still converges. Empirical runs (e.g., a Vision Transformer on CIFARâ10 and several sequence and vision models) show sharpness rising to the threshold 2/η, oscillations growing along the top Hessian eigenvector(s), transient loss spikes, and then a rapid drop in sharpness back below 2/ηâan effect termed training at the edge of stability (EOS) that appears universal across architectures.
The technical novelty is a thirdâorder Taylor analysis: when GD oscillates by displacement x along the top eigenvector u, the gradient picks up a cubic-term contribution â (1/2) x^2 âS( wÌ ), so each negative gradient step implicitly performs a negative step on sharpness with effective step size (1/2) η x^2. Thus large oscillations generate automatic negative feedback that reduces S(w) and re-enters the stable region, explaining how deterministic GD can succeed despite local quadratic predictions of divergence. Multiple eigenvalues at the edge produce higherâdimensional, potentially chaotic dynamics. Implications include a clearer mechanistic basis for learningârate choice, Hessianâaware optimizers, and better theoretical grounding for SGD behavior in deep learning.
Loading comments...
login to comment
loading comments...
no comments yet