T-Mac: Low-bit LLM inference on CPU/NPU with lookup table (github.com)

0 points 3 hours ago ago | visit original

🤖 AI Summary

T‑MAC is an open, LUT‑based kernel library for low‑bit LLM inference that replaces dequantize‑then‑multiply workflows with direct lookup‑table matrix multiplications (supporting int1/2/3/4 × int8/fp16/fp32). The project (powering BitNet and now integrated into llama.cpp for prefill) was open‑sourced and accepted to EuroSys 2025. Benchmarks across Surface Laptop 7, M2‑Ultra, Jetson AGX Orin, Raspberry Pi 5 and Snapdragon X Elite show 3–5× token‑generation speedups vs. llama.cpp’s dequantization kernels (e.g., 20 t/s single‑core and 48 t/s on four cores for 3B BitNet on Surface; Raspberry Pi 5 ~11 t/s), dramatic prefill gains (Llama‑2‑7B W2: 50.1 t/s @4 threads vs 12.0 t/s baseline) and the ability to meet real‑time targets using far fewer CPU cores. Technically, T‑MAC implements mixed‑precision mpGEMM via LUT lookups (avoiding heavy FMA), delivers multi‑batch (N>1) benefits, and supports GPTQ/gguf quant formats (W1(.58)A8, W2A16, W4A16, etc.). It yields lower power and energy per token (e.g., on Jetson AGX Orin T‑MAC CPU: 15.6 t/s at 10.4 W → 0.66 J/token vs llama.cpp CPU 2.12 J/token), and on some chips (Snapdragon X Elite) CPU+T‑MAC outperforms the NPU. Fast‑aggregation can add another 10–20% speedup. The tradeoffs: building requires TVM/LLVM and platform tuning, and gains vary by CPU memory bandwidth and architecture, but the approach makes high‑quality low‑bit LLM inference far more practical on edge and ARM platforms.

Loading comments...

loading comments...