🤖 AI Summary
Researchers and open-source contributors have released Lion Schedule-Free (LionSF), a variant of the Lion optimizer that removes the need for an external learning-rate scheduler by scaling updates based on a sign-agreement mechanism between the momentum and the current gradient. Practically, LionSF adapts its effective learning rate automatically: when momentum and gradient align, updates are amplified; when they disagree, updates are attenuated. Lion itself differs from AdamW in defaults (β1,β2 = 0.9,0.99 vs AdamW’s 0.9,0.999) and recommendation patterns: a typical Lion learning rate is 3–10× smaller than AdamW, and because effective weight decay is lr * λ, the decoupled λ used with Lion should be 3–10× larger to keep decay strength similar. Cosine decay often helps ViT training, and reducing β2 or raising ε (e.g., β1=0.95, β2=0.98) can stabilize training.
Empirical reports are promising but mixed: several language-modeling and text-to-image runs outperform Adam when tuned (notably with ~3× smaller lr and large batches ≥64), but some tasks—RL, certain feedforward or hybrid architectures, and sensitivity to batch size/augmentation—show negative results unless adjusted. Implementation is available via pip/conda (lion-pytorch) with an optional Triton-fused CUDA kernel for speed. Overall, LionSF offers a simple, scheduler-free alternative that can simplify training pipelines and improve results in many large-batch, high-data regimes, but it still requires careful tuning per task.
Loading comments...
login to comment
loading comments...
no comments yet