Muon: An optimizer for hidden layers in neural networks (kellerjordan.github.io)

🤖 AI Summary
Muon is a newly developed optimizer designed for the hidden layers of neural networks, utilizing a Newton-Schulz iteration to orthogonalize updates generated by standard methods like SGD-momentum. This approach aims to enhance the training efficiency of neural networks, particularly large language models and convolutional networks. Empirical results indicate that Muon has achieved significant speed improvements—reducing training time to 94% accuracy on CIFAR-10 from 3.3 to 2.6 A100-seconds and outperforming existing methods, like AdamW, in various benchmarks. The significance of Muon lies in its innovative design, which leverages a novel orthogonalization technique to enhance the learning dynamics of neural networks. By minimizing the condition number of updates, Muon promotes the exploration of previously underrepresented parameter directions, facilitating better convergence during training. Notably, the memory and computational overhead associated with Muon remains modest, typically below 1% for various large-scale training scenarios. These advancements position Muon as a notable contender in the landscape of neural network optimizers, potentially influencing future developments and benchmarks in AI/ML research.
Loading comments...
loading comments...