The Curse of Depth in Large Language Models (arxiv.org)

🤖 AI Summary
Researchers from Westlake University and other institutions have introduced the concept of the "Curse of Depth" in Large Language Models (LLMs), highlighting that nearly half of the layers in popular models like LLaMA and DeepSeek underperform compared to their earlier layers. This phenomenon is largely attributed to the use of Pre-Layer Normalization (Pre-LN), which stabilizes training but causes output variance to increase exponentially with depth. As a result, deeper layers exhibit diminished effectiveness, often resembling an identity matrix in their contributions to training, leading to inefficiency and wasted computational resources. To address this issue, the researchers propose LayerNorm Scaling (LNS), a modification that adjusts the output of layer normalization by scaling it inversely with the square root of layer depth. Their experiments demonstrate that LNS significantly enhances the training performance across a variety of model sizes, ensuring that deeper layers contribute more effectively. This simple, easily implementable approach not only improves pre-training but also boosts downstream task performance, marking an important advancement in optimizing the architecture of LLMs and enhancing their utility in real-world applications.
Loading comments...
loading comments...