The math behind tiled v/s naive matrix multiplication in CUDA (alvinwan.com)

🤖 AI Summary
A Guide to Machine Learning post breaks down why tiled (blocked) matrix multiplication in CUDA dramatically outperforms the naive inner‑product approach by cutting memory traffic rather than compute. Using small examples (8×8 with b=1,2,4), the author shows fetch counts fall from 1,024 (naive) → 576 (row reuse) → 256 (4×4 tiles). More generally for A∈R^{m×k}, B∈R^{k×n} a naive GEMM does 2mnk memory fetches, while tiling with block size b reduces that to 2mnk/b — i.e., each increase in b yields a b× reduction in memory accesses. Tiling also exposes parallelism (blocks compute independently) and improves reuse of values in fast on‑chip storage, which is why dense workloads like transformer attention and MLPs—memory‑bandwidth bound at inference—benefit so much. The post then ties the math to hardware limits: on an NVIDIA V100 (900 GB/s memory, 112 TFLOPS) moving 3.4 GB of FP16 weights costs ~3.8 ms vs ~1.2 ms to compute, so bandwidth dominates. Using b=4 would cut memory transfer time 4× (to ~0.95 ms). But shared memory size constrains how large b can be: storing b rows and b columns may exceed the 96 KB shared memory, forcing smaller b or multi‑stage blocking (split the k dimension into ℓ chunks) which trades extra writes/accumulations for reduced on‑chip footprint. The result is a practical recipe: pick b (and inner chunk ℓ) to balance reduced DRAM traffic, available shared memory, and write/accumulation overhead for optimal throughput.
Loading comments...
loading comments...