Exceeding SOTA Matrix Multiplication on Nvidia Blackwell (www.modular.com)

🤖 AI Summary
Researchers building a matmul kernel on NVIDIA’s new Blackwell (B200) GPUs used its Cluster Launch Control (CLC) and other architectural features to push past the previous state-of-the-art. By combining a persistent-kernel tile scheduler (keeping CTAs resident on SMs), a hardware-driven CLC scheduler (a scheduler warp that deposits work coordinates in cluster-shared memory), pipelined CLC fetches, a circular TMEM buffer for overlapping MMA and epilogue work, and a thread-block swizzle to boost L2 reuse, they raised performance another ~15% to 1,772.9 TFLOPs (100.6% of cuBLAS) for 4096x4096x4096 and matched/exceeded SOTA on production workloads after autotuning. Technically, the key innovations are hardware/software co-design: Blackwell’s silicon scheduler implements a producer–consumer model so CTAs fetch work from a cluster-wide queue without full relaunches, while pipelined shared-memory slots avoid coordinator stalls. Treating TMEM as a circular buffer eliminates idle producer/consumer warps and enables concurrent tensor MMA and output writes; block_swizzle patterns trade tile ordering for higher L2 reuse. Practical implications: peak performance now depends on picking MMA instruction shape, pipeline depth and swizzle size per problem shape—Mojo’s kbench autotuner finds those settings and produced up to +6% over SOTA on Gemma-3-27B workloads. The work underscores that future GPU performance gains will require increasingly sophisticated, architecture-aware scheduling and autotuning.
Loading comments...
loading comments...