🤖 AI Summary
The post walks through building a high-performance Tensor Core matrix‑multiply kernel for NVIDIA’s Ada (RTX 4090) from a naive PTX mma implementation to a production‑class kernel that matches cuBLAS for a specific problem (M=N=K=4096, fp16 inputs with fp32 accumulation). Starting at 29.4 TFLOP/s (19% of cuBLAS) the author incrementally applies well‑known CUTLASS techniques—vectorized 128‑bit (uint4) loads, a permuted shared‑memory layout to eliminate bank conflicts, ldmatrix and cp.async primitives, n‑stage global→shared pipelining, and increased tiling—ultimately reaching 153.6 TFLOP/s in 895 µs (100% of cuBLAS, ~93% of measured RTX 4090 peak).
This matters to ML/AI practitioners who need custom, latency‑sensitive or specialty GEMMs (e.g., kernels for attention or nonstandard shapes): it demonstrates that near‑peak Tensor Core performance requires tight coordination of instruction choice (m16n8k16 mma.sync), data layout, and asynchronous memory movement rather than just issuing mma ops. The author also derives a useful hardware latency bound (m16n8k16 ≈ 32 cycles from peak throughput), explains warp/fragments mapping and why naive loads cause shared‑memory stalls, and documents reproducible benchmarking (boost‑clock locking, nsight‑compute). The code prioritizes clarity over generality, making it a practical tutorial for anyone wanting to squeeze Ada Tensor Cores for custom kernels.
Loading comments...
login to comment
loading comments...
no comments yet