🤖 AI Summary
AMD’s HipKittens project publishes an opinionated set of programming primitives to unlock the raw compute of CDNA GPUs — hardware that’s fast on paper but underutilized by current AMD toolchains. The team lays out a pragmatic alternative to NVIDIA-style wave-specialization: optimized register tiles, 8-wave/4-wave kernel patterns, and chiplet-aware cache reuse to schedule work both within and across processors. HipKittens exposes a PyTorch-like tile API that directly wraps assembly (CDNA ISA/HIP) so developers can control register allocation, use small matrix-core instructions, and leverage AMD’s TMA-like direct global→shared loads to build deep pipelines without relying on undocumented or missing compiler/ISA features.
Technically, HipKittens responds to CDNA specifics: MI355X has 256 compute units split into eight 32‑CU chiplets (XCDs) with disaggregated L2 + LLC, a larger per‑processor register file but smaller shared SRAM than comparable NVIDIA parts, and different matrix-instruction/layout primitives. Key problems it solves are explicit register scheduling (workarounds for HIPCC’s AGPR/VGPR limitations), per-shape swizzle patterns to avoid bank/phase conflicts (64-bit vs 128-bit instruction granularities require different layouts), and NUMA-style scheduling across chiplets. Empirically, HipKittens’ 4/8 and 4/12 producer/consumer configurations show high TFLOPs without classic wave-specialization, demonstrating that AMD’s architectural trade-offs (more registers, fine-grained tensor ops, chiplets) demand new kernel design patterns rather than ported NVIDIA approaches.
Loading comments...
login to comment
loading comments...
no comments yet