Understanding Optimization in Deep Learning with Central Flows (centralflows.github.io)

0 points 3 hours ago ago | visit original

🤖 AI Summary

ICLR 2025 authors released a companion website for the paper "Understanding Optimization in Deep Learning with Central Flows," presenting a new explanation for a now-familiar but poorly explained training phenomenon: gradient descent frequently exits the classical “stable” region (where sharpness S(w) = λ1(H(w)) < 2/η), briefly oscillates, and yet still converges. Empirical runs (e.g., a Vision Transformer on CIFAR‑10 and several sequence and vision models) show sharpness rising to the threshold 2/η, oscillations growing along the top Hessian eigenvector(s), transient loss spikes, and then a rapid drop in sharpness back below 2/η—an effect termed training at the edge of stability (EOS) that appears universal across architectures. The technical novelty is a third‑order Taylor analysis: when GD oscillates by displacement x along the top eigenvector u, the gradient picks up a cubic-term contribution ≈ (1/2) x^2 ∇S( w̄ ), so each negative gradient step implicitly performs a negative step on sharpness with effective step size (1/2) η x^2. Thus large oscillations generate automatic negative feedback that reduces S(w) and re-enters the stable region, explaining how deterministic GD can succeed despite local quadratic predictions of divergence. Multiple eigenvalues at the edge produce higher‑dimensional, potentially chaotic dynamics. Implications include a clearer mechanistic basis for learning‑rate choice, Hessian‑aware optimizers, and better theoretical grounding for SGD behavior in deep learning.

Loading comments...

loading comments...