Matrix Core Programming on AMD CDNA3 and CDNA4 Architecture (salykova.github.io)

🤖 AI Summary
AMD published a practical guide to programming its Matrix Cores from HIP/LLVM, focusing on modern low‑precision formats (FP16, FP8, FP6, FP4), new block‑exponent (microscaling) support in CDNA™4, and the LLVM intrinsics you need to emit MFMA instructions. The post walks through data layouts, FPx format details (including E4M3/E5M2 variants like E4M3FN vs FNUZ and OCP-standard formats used on CDNA4), and exposes E8M0 as a special scale format for block scaling. It also includes runnable code examples and links to ROCm docs and ISA reference material. Why it matters: Matrix Cores drive massive throughput improvements for AI/HPC by moving multiply‑accumulate into dedicated hardware while keeping accumulators in FP32 to limit error. On MI325X (CDNA3) FP16 gives ~8× and FP8 ~16× speedups vs FP32; CDNA4 boosts FP16/FP8 up to 2× further and introduces FP6/FP4, enabling up to ~64× gains vs FP32. CDNA4 also adds many new MFMA forms (e.g., 16×16×128 and 32×32×64 for FP4/FP6/FP8), independent A/B operand types, and block scaling. LLVM intrinsics follow the pattern __builtin_amdgcn_mfma_ODType_MxNxKInDType(a,b,c,...) and MFMA is a wavefront‑level op (operands distributed across threads). The post even gives the peak‑TFLOP formula (2*M*N*K * matrix_cores * clock / cycles / 1e6) and an example FP16 MFMA calculation (32×32×8 → ~1307 TFLOP on MI325X), making it immediately useful for performance tuning and low‑precision model work.
Loading comments...
loading comments...