🤖 AI Summary
HPC-AI’s team demonstrates that 2:4 semi-structured sparsity—where exactly two of every four consecutive weights are zero—can deliver practical, hardware-accelerated LLM inference gains on NVIDIA GPUs. Using post-training pruning (SparseGPT) plus an open-source stack (llm-compressor, vLLM) and NVIDIA’s CUTLASS sparse GEMM, they report a 1.27× end-to-end speedup (97.6s vs 123.7s on a 1,000-prompt benchmark with 1024-token inputs/outputs) and a 1.22× FP8 sparse GEMM speedup versus cuBLAS FP8 (≈2× vs cuBLAS BF16). The work leverages native 2:4 support in NVIDIA Ampere/Hopper tensor cores (sparse tensor cores, very high fp8 TFLOPS), showing that semi-structured sparsity unlocks real throughput improvements with modest engineering.
Technically, SparseGPT performs one-shot pruning by approximating the Hessian inverse of weight matrices to choose which weights to zero, making post-training sparsification scalable to very large models; the example reduced a Llama-3-8B checkpoint from ~30 GB to ~6.1 GB. Important tradeoffs remain: pruning without fine-tuning can degrade accuracy (their 5-shot MMLU: dense 0.667 → SparseGPT 0.404), while fine-tuned sparse models recover much of the loss (≈0.604). Bottom line: 2:4 sparsity is a practical, GPU-friendly path to faster, smaller LLMs using mature open-source tools—useful for teams prioritizing inference cost and throughput, but expect to invest in fine-tuning when accuracy matters.
Loading comments...
login to comment
loading comments...
no comments yet