đ¤ AI Summary
A research team released a throughput-optimized "megakernel" for tensor-parallel Llama-70B inference on H100s and open-sourced the code (research-quality, unsupported). Integrated into the Tokasaurus engine, the megakernel uses an onâGPU interpreter running per-SM to execute fine-grained instructions and aggressively pipeline loads, compute, stores and communication. On the ShareGPT 65,536-prompt benchmark it outperforms SGLang by >22% end-to-end, demonstrating that fusing whole-model logic into a single GPU-resident kernel can substantially increase large-batch throughput.
Technically, the work extends prior low-latency megakernel ideas to large-batch, tensor-parallel workloads by (1) defining fused instructions for blocks like RMS+all-gather, QKV+RoPE, attention+distributed-transpose, O-projection+residual and MLP+reduce-scatter; (2) using sequence-parallel TP but switching the O projection to data-parallel replication to replace a costly reduce-scatter with a distributed transpose, cutting network traffic ~8x at the cost of ~9 GB/GPU (â15% max batch size loss); and (3) overlapping resources at multiple levelsâwithin SMs (inter-instruction pipelining to keep tensor cores busy), across SMs (scheduling compute- vs memory-bound work concurrently), and across GPUs (background âstorerâ threads to hide NVLink transfers). The result shows the interpreter/megekernel abstraction transfers from latency to throughput regimes and highlights practical trade-offs (memory overhead, fragility across compilers/GPU setups) for teams aiming to wring more performance from H100s.
Loading comments...
login to comment
loading comments...
no comments yet