🤖 AI Summary
MACKO-SpMV (Mutually Aligned Compressed coordinates Kernel Optimised SpMV) is a new sparse matrix format plus CUDA kernel and PyTorch-friendly library optimized for moderate-to-high sparsities common in neural network pruning (roughly 20–90%). The authors claim substantial memory and throughput wins over dense cuBLAS and classic CSR: on consumer GPUs (RTX 2080/3090/4090) fp16 runs show ~1.5× memory reduction and 1.3–1.5× speedup at 50% density, and ~6.25× memory reduction with 3.5–4.5× speedup at 10% density. In practical terms, an fp16 Llama-2-7B pruned with wanda saw memory drop from 13.59GB to 8.87GB (50% density) and to 2.67GB (10%), while tokens/sec rose from 66.5 to 98.6 and 255.0 respectively.
Technically, MACKO uses a compressed coordinate-style layout and a custom SpMV kernel compiled on first import (via torch.load_inline), exposing compress() and multiply() APIs for Torch tensors and supporting torch compilation. The format is portable across GPUs but the kernel needs extra tuning for server cards (H100, V100), and further work is planned for more dtypes (bfloat16, fp8, int8, float32), batch routing, and server-GPU profiling. The repo is pip-installable and open-source (paper arXiv:2511.13061); contributions are invited, especially for optimizing GEMV on H100.
Loading comments...
login to comment
loading comments...
no comments yet