Creating custom kernels for the AMD MI300 (huggingface.co)

🤖 AI Summary
Hugging Face and AMD released a set of open‑source, MI300X‑optimized GPU kernels (hf-rocm-kernels) designed to speed up serving Llama 3.1 405B in FP8 with VLLM on an 8× MI300X node. The work focuses on kernel‑level wins—often overlooked compared with model or quantization changes—and delivers three hand‑tuned kernels: a fused residual+RMSNorm+FP8 conversion, a fused SwiGLU+FP8 conversion, and a “skinny” GEMM. Measured in a decoding regime (input size 1, output 128, median over 30 runs) these kernels yield significant latency improvements; they can be used standalone, reproduced with the provided benchmarks and container, and will be integrated into AMD’s VLLM fork. Technically, the team profiled VLLM to identify bottlenecks and found GEMMs and inter‑GPU communications dominated latency, while RMSNorm and SwiGLU together accounted for ~15%—making them high‑impact targets alongside GEMM optimizations. The kernels exploit MI300X hardware characteristics (304 CUs from 8 XCDs of 38 CUs each, 64‑thread warps, 256 VGPRs/thread, 16 warps/CU, 32KB L1, 64KB shared mem, 4MB L2, 256MB infinity cache, 192GB VRAM) to balance compute, memory and synchronization costs. This effort broadens performant non‑NVIDIA inference options, provides a template for custom kernel development (including compute‑and‑communicate kernels), and demonstrates how modest kernel‑level gains translate to large scale latency and energy savings for large‑model inference.
Loading comments...
loading comments...