Modern Optimizers – An Alchemist's Notes on Deep Learning (notes.kvfrans.com)

🤖 AI Summary
A deep-dive post surveys the recent wave of "spectral‑whitening" optimizers that aim to beat Adam on the compute–performance Pareto frontier by using richer, layer‑wise preconditioners instead of elementwise scaling. The core idea is to replace the Euclidean step metric with a whitening metric—mathematically the square‑root of the Gauss‑Newton (or closely related empirical Fisher) matrix—so each parameter updates at a near‑optimal local rate. This unifies perspectives: Newton/Gauss‑Newton second‑order reasoning, natural gradient (Fisher) geometry, and a spectral‑norm viewpoint (where Kronecker factorizations yield updates projecting gradients onto UV^T). The result is better conditioning and potentially faster convergence/stability than Adam, especially when compute is used to form and invert structured covariance estimates. Technically, practical methods approximate the full whitening metric blockwise per dense layer to keep costs tractable. Adam/RMSProp is the cheap elementwise baseline; Shampoo (and variants) keep Kronecker factors of gradient covariances and periodically eigendecompose or matrix‑power them to form inverses; SOAP rotates into eigenbases and runs inner Adam, SPlus uses signs, and PSGD iteratively learns a symmetric positive‑definite preconditioner via a Q^TQ parameterization. Tradeoffs are clear: spectral‑whitening can equalize parameter sensitivities and align with natural gradients, but incurs memory/compute for factor storage, eigendecompositions or cached inverses and requires block approximations and hybrid fallbacks (e.g., Adam for embeddings/layer‑norm). Overall, these methods offer principled, higher‑fidelity preconditioning that can reduce training time if their extra compute is justified.
Loading comments...
loading comments...