Spikes in LLMs Are Bias Vectors: Spike-Free Quantization (arxiv.org)

🤖 AI Summary
A new study has identified that significant activation spikes in Large Language Models (LLMs) are not just high-level biases but are manifestations of structural vector biases linked to spike-carrying tokens. These spikes disrupt quantization processes by stretching dynamic ranges, ultimately affecting model performance. The research delves into how projection weights interact with these tokens, highlighting their tendency to stabilize into constant vectors that influence attention mechanisms. Moreover, the study demonstrates that models can maintain these biases even when subjected to perturbations like Rotary Positional Embedding by confining them to "zones of rotational stability." To address the challenges posed by these spikes, the researchers introduced a post-training quantization framework called INSERTQUANT. This innovative approach effectively clamps spikes while preserving their functional capacity using pre-computed template vectors, enabling a spike-free architecture that enhances low-bit quantization without sacrificing fidelity. Impressively, INSERTQUANT matches the performance of leading per-tensor quantization methods for LLMs and showcases versatility across other modalities, such as Vision Transformers (ViTs). This advancement is significant for the AI/ML community, as it opens new pathways for optimizing model performance and efficiency in diverse applications.
Loading comments...
loading comments...