ButterflyQuant: Ultra-low-bit LLM Quantization (arxiv.org)

0 points 2 days ago ago | visit original

🤖 AI Summary

ButterflyQuant introduces a learnable, orthogonal “butterfly” rotation for ultra-low-bit LLM quantization to tackle the catastrophic performance drop seen in aggressive 2-bit schemes caused by activation outliers. Prior rotation-based fixes (e.g., QuIP/QuaRot) apply fixed Hadamard transforms to spread outliers before quantization, but Hadamard’s discrete ±1 structure is non-differentiable and not layer-adaptive. ButterflyQuant parameterizes orthogonal butterfly transforms with continuous Givens rotation angles, preserving orthogonality (and thus the theoretical invariance y = W x = (W Q^T)(Q x)) while enabling gradient-based learning that adapts to layer-specific outlier patterns. The transform costs O(n log n) compute, uses only (n log n)/2 learnable parameters, and is complemented by a uniformity regularizer on post-rotation activations to produce distributions more amenable to low-bit quantization. Practically, learning requires only 128 calibration samples and converges in minutes on a single GPU—a negligible one-time cost—making it appealing for real-world model deployment. On LLaMA-2-7B with 2-bit quantization, ButterflyQuant cuts perplexity to 15.4 versus 22.1 for QuaRot, demonstrating substantial accuracy recovery. The work suggests that layer-adaptive, differentiable orthogonal rotations are a scalable, low-cost route to practical ultra-low-bit LLMs, enabling broader deployment on consumer hardware and opening avenues for more flexible, learned pre-quantization transforms.

Loading comments...

loading comments...