🤖 AI Summary
AMD published a practical guide to programming Matrix Cores on CDNA™3 and the new CDNA™4 GPUs, showing how to use low‑precision types (FP16, FP8, FP6, FP4) and the CDNA4 feature of block exponent scaling inside HIP kernels. The post explains MFMA semantics (D := A*B + C, typically accumulated in FP32 to limit error), the data layouts and LLVM/HIP compiler intrinsics needed to invoke MFMA instructions, and the concrete performance and accuracy trade‑offs for mixed‑precision math. This matters for AI/ML because CDNA4 dramatically raises matrix throughput—examples: on CDNA3 (MI325X) FP16 ≈8× and FP8 ≈16× vs FP32, while CDNA4 (MI355X) delivers ~16× for FP16, ~32× for FP8 and up to ~64× for FP6/FP4—enabling much higher FLOP/s for training and inference when precision can be reduced.
Key technical points: MFMA instructions come in many MxNxK shapes (e.g., 16×16×16, 32×32×16, 16×16×128 for very low precision) with known cycle counts so theoretical peak TFLOP/s can be estimated (2*M*N*K * num_matrix_cores * clock / cycles). CDNA4 adds independent FP8/FP6/FP4 operand support and block exponent (microscaling) using special 8‑bit E8M0 scalars. FP8 comes in E4M3 and E5M2 variants (with FN/FNUZ subtleties); intrinsics use names like __builtin_amdgcn_mfma_f32_32x32x16_fp8_fp8 and execute at wavefront granularity, distributing A/B/C/D fragments across threads. The blog includes code examples and points to ROCm/ISA docs for full details.
Loading comments...
login to comment
loading comments...
no comments yet