New Open Source Technique shrinks LLMs to let them run on less Powerful Hardware (huggingface.co)

0 points 12 hours ago ago | visit original

🤖 AI Summary

Researchers released SINQ, an open-source, calibration-free post-training quantization technique that shrinks large language models so they can run at very low precision (≤4 bits) with much less perplexity degradation. SINQ adds a second-axis scale factor to standard uniform quantizers and uses a fast Sinkhorn‑Knopp–style normalization to find per-row and per-column scales that equalize variances. By minimizing a new per-matrix “matrix imbalance” proxy target, SINQ reduces the harmful effect of outlier weights that would otherwise force shared scales to lose precision. The method is layer-independent, plugs into existing PTQ workflows, and the authors provide code and example quantized models. Technically, SINQ’s key contribution is an efficient algorithm for bi-directional scaling of linear-layer weight matrices (no calibration data required) that preserves accuracy for large models. The paper reports notable improvements in WikiText2 and C4 perplexity on Qwen3 and DeepSeek‑V2.5 compared with uncalibrated uniform quantization, and shows further gains when combined with calibration or non-uniform quantizers. Practically, this enables cheaper memory and compute footprints, easier on-device or resource-constrained deployment of LLMs, and a simple drop-in tool for researchers and engineers aiming to squeeze models onto less powerful hardware.

Loading comments...

loading comments...