We Bought the Whole GPU, So We're Damn Well Going to Use the Whole GPU (hazyresearch.stanford.edu)

0 points 3 hours ago ago | visit original

🤖 AI Summary

A research team released a throughput-optimized "megakernel" for tensor-parallel Llama-70B inference on H100s and open-sourced the code (research-quality, unsupported). Integrated into the Tokasaurus engine, the megakernel uses an on‑GPU interpreter running per-SM to execute fine-grained instructions and aggressively pipeline loads, compute, stores and communication. On the ShareGPT 65,536-prompt benchmark it outperforms SGLang by >22% end-to-end, demonstrating that fusing whole-model logic into a single GPU-resident kernel can substantially increase large-batch throughput. Technically, the work extends prior low-latency megakernel ideas to large-batch, tensor-parallel workloads by (1) defining fused instructions for blocks like RMS+all-gather, QKV+RoPE, attention+distributed-transpose, O-projection+residual and MLP+reduce-scatter; (2) using sequence-parallel TP but switching the O projection to data-parallel replication to replace a costly reduce-scatter with a distributed transpose, cutting network traffic ~8x at the cost of ~9 GB/GPU (≈15% max batch size loss); and (3) overlapping resources at multiple levels—within SMs (inter-instruction pipelining to keep tensor cores busy), across SMs (scheduling compute- vs memory-bound work concurrently), and across GPUs (background “storer” threads to hide NVLink transfers). The result shows the interpreter/megekernel abstraction transfers from latency to throughput regimes and highlights practical trade-offs (memory overhead, fragility across compilers/GPU setups) for teams aiming to wring more performance from H100s.

Loading comments...

loading comments...