Attention Sinks and Compression Valleys in LLMs (arxiv.org)

🤖 AI Summary
Recent research reveals a significant connection between two previously isolated phenomena in large language models (LLMs): attention sinks and compression valleys. This study demonstrates that both arise from massive activations in the residual stream of these models. By theoretically proving that such activations lead to representational compression, the researchers established bounds on entropy reduction, confirming their findings through experiments across various model sizes (from 410M to 120B parameters). Their targeted ablation studies validate the simultaneous emergence of attention sinks and compression valleys when extreme activation norms occur in the middle layers. The implications of this research are profound for the AI/ML community, as it introduces the Mix-Compress-Refine theory of information flow. This new framework posits that transformer-based LLMs process information in three phases: broad mixing in early layers, compressed computation in middle layers, and selective refinement in later layers. This theory sheds light on why embedding tasks excel at intermediate layers while generation tasks necessitate full-depth processing, enhancing our understanding of task-dependent representations and the internal workings of LLMs.
Loading comments...
loading comments...