HipKittens: Fast and Furious AMD Kernels (hazyresearch.stanford.edu)

0 points 252 days ago ago | visit original

🤖 AI Summary

HipKittens is a new, opinionated set of C++-embedded programming primitives plus a collection of state-of-the-art AMD GPU kernels designed to unlock AMD’s peak AI performance without recourse to raw assembly. The project targets the MI355X and other CDNA GPUs, which on paper offer competitive or superior peak matrix throughput and much larger memory capacity than recent NVIDIA parts (e.g., MI355X BF16 2.5 PFLOPs vs NVIDIA 2.2 PFLOPs, memory 288 GB vs 180 GB). The authors show that existing AMD software (AITER, PyTorch kernels, Triton, Mojo, TileLang, Composable Kernel) often hits a fraction of peak — examples include AITER/PyTorch attention backward at ~30%/24% of SoTA and Mojo’s MHA at ~50% — leaving much hardware performance untapped. Technically, HipKittens argues that tile-based abstractions (tile types, bulk compute ops, composable load/store) generalize across GPUs, while backend implementations and scheduling must be architecture-specific to handle CDNA quirks like swizzling, bank-conflict avoidance, register scheduling, and chiplet layouts. Wave specialization underperforms on CDNA3/4, but reasoning at tile granularity simplifies development and attains peak throughput: their ~500-line attention forward, GEMM hot-loop, and attention backward/rotary/fused kernels outperform available AMD baselines (including hand-written AITER assembly). The work suggests a practical path toward a unified, high-performance multi‑silicon programming model by separating high-level tile interfaces from hardware-specific implementations.

Loading comments...

loading comments...