Loads and Loads of Fluffy Kittens (hazyresearch.stanford.edu)

🤖 AI Summary
ThunderKittens’ follow-up announces a set of principled guidelines and new multi‑GPU compute–communication kernels that let you fuse collectives (all‑gather, reduce‑scatter, all‑reduce) with heavy compute (GEMMs, attention, expert dispatch) in a few dozen lines of device code and match or exceed hand‑optimized implementations. The team distills three repeatable levers for high utilization on modern multi‑GPU systems (HGX H100/B200, NVSwitch): pick the right transfer mechanism, pick the right overlapping schedule, and use tile‑granularity communication. Practical findings: the copy engine hits peak NVLink only for very large chunks (~256 MB), TMA can saturate NVLink with tiny messages (≈2 KB) while using few SMs (≈15/148 on B200), and register‑level instructions are required for in‑network reductions and element‑level transfers. They identify two scheduling families: intra‑SM overlapping (assigning warps to communication so tensor cores run uninterrupted — e.g., GEMM+reduce‑scatter achieved <1% non‑overlapped communication) and inter‑SM overlapping (dedicating SMs to comms to exploit NVSwitch in‑network accel and avoid repeated remote L2 traversals for workloads like attention). Tiles remain effective and simplify fusion without hurting bandwidth if TMA or coalesced remote accesses are used. The net implication: libraries should dynamically combine transfer primitives, overlapping styles, and tiling to get near‑optimal performance across diverse parallelism strategies, and ThunderKittens provides an API to do this with minimal device‑code complexity.
Loading comments...
loading comments...