Gated Attention for Large Language Models (arxiv.org)

0 points 226 days ago ago | visit original

🤖 AI Summary

Researchers evaluated how simple gating changes affect transformer attention and found a surprisingly powerful, low-cost improvement: applying a head-specific sigmoid gate right after Scaled Dot-Product Attention (SDPA) consistently boosts performance, training stability, and scaling. The paper systematically compares 30 gating variants across large-scale experiments (15B Mixture-of-Experts and 1.7B dense models trained on a 3.5 trillion-token corpus) and shows that this single modification lets models tolerate larger learning rates, learns more stably, and scales better than ungated baselines. The authors also release code and model weights to encourage follow-up work. Technically, the benefit comes from two complementary effects: (1) the gate injects non-linearity into the low-rank mapping performed by softmax attention, and (2) the sigmoid gate produces query-dependent, effectively sparse modulation of attention outputs. That induced sparsity helps avoid “attention sink” — where attention collapses onto a few tokens — and improves long-context extrapolation. Because the change is local (per-head gating after SDPA), it’s cheap to add to existing transformer stacks yet yields measurable robustness and generalization gains, making it a practical lever for researchers and practitioners tuning large language models.

Loading comments...

loading comments...