AMD GPUs Go Brrr (hazyresearch.stanford.edu)

0 points 245 days ago ago | visit original

🤖 AI Summary

AMD’s HipKittens project publishes an opinionated set of programming primitives to unlock the raw compute of CDNA GPUs — hardware that’s fast on paper but underutilized by current AMD toolchains. The team lays out a pragmatic alternative to NVIDIA-style wave-specialization: optimized register tiles, 8-wave/4-wave kernel patterns, and chiplet-aware cache reuse to schedule work both within and across processors. HipKittens exposes a PyTorch-like tile API that directly wraps assembly (CDNA ISA/HIP) so developers can control register allocation, use small matrix-core instructions, and leverage AMD’s TMA-like direct global→shared loads to build deep pipelines without relying on undocumented or missing compiler/ISA features. Technically, HipKittens responds to CDNA specifics: MI355X has 256 compute units split into eight 32‑CU chiplets (XCDs) with disaggregated L2 + LLC, a larger per‑processor register file but smaller shared SRAM than comparable NVIDIA parts, and different matrix-instruction/layout primitives. Key problems it solves are explicit register scheduling (workarounds for HIPCC’s AGPR/VGPR limitations), per-shape swizzle patterns to avoid bank/phase conflicts (64-bit vs 128-bit instruction granularities require different layouts), and NUMA-style scheduling across chiplets. Empirically, HipKittens’ 4/8 and 4/12 producer/consumer configurations show high TFLOPs without classic wave-specialization, demonstrating that AMD’s architectural trade-offs (more registers, fine-grained tensor ops, chiplets) demand new kernel design patterns rather than ported NVIDIA approaches.

Loading comments...

loading comments...