🤖 AI Summary
A new deep-dive post (the first in a planned series) lays out the hardware concepts and programming techniques behind state-of-the-art NVIDIA GPU matrix-multiplication (matmul) kernels. The author focuses on Hopper H100, explaining why matmul matters (transformers spend most FLOPs in matmuls) and promising a self-contained walkthrough from CUDA programming primitives to SOTA asynchronous kernels. Upcoming installments will cover Blackwell, multi‑GPU kernels, microbenchmarking, and GPU memory-consistency models—topics that will matter to anyone squeezing performance from modern ML hardware.
Technically, the post builds a practical mental model of the GPU: a hierarchical memory system (HBM device memory, L2, L1/shared memory (SMEM), and registers) and the compute stack (132 SMs on H100, grouped into 8 GPCs of 18 SMs, tensor cores, CUDA cores, LD/ST units, and warp schedulers). Key performance points include GMEM coalescing, SMEM’s 32 banks × 32‑bit width (bank conflicts vs. multicast), the Tensor Memory Accelerator (TMA) for asynchronous GMEM↔SMEM transfers and swizzling, and the “speed of light” throughput formula (clk × #TC × FLOPs/TC) which is sensitive to power/thermal throttling. The post also previews algorithmic techniques—warp‑tiling for synchronous kernels and asynchronous Hopper optimizations (tensor cores, TMA overlap, Hilbert tiling)—equipping ML engineers to design near‑SOTA matmul kernels and extend those ideas to other GPU workloads.
Loading comments...
login to comment
loading comments...
no comments yet