My First Multi-GPU Kernel: Writing All-to-All for AMD MI300X (gau-nernst.github.io)

🤖 AI Summary
At the AMD Distributed Challenge the author built their first multi‑GPU kernel on an AMD MI300X, implementing the all‑to‑all dispatch/combine primitives used by Mixture‑of‑Experts (MoE) layers. They first replaced slow Python loops in a reference PyTorch kernel with a sort‑based grouping strategy: flatten top‑k indices, argsort to group tokens by expert, perform grouped GEMM (simulated here with a pointwise op), then scatter‑reduce back. That PyTorch‑only rewrite cut runtime from 93,540 μs to 1,311 μs, but the author then explored custom HIP kernels to go beyond what PyTorch exposes and to enable true in‑kernel remote memory access and overlap of compute with inter‑GPU communication. Technically, the author uses GPU peer‑to‑peer (P2P) IPC handles (cudaIpcGetMemHandle / cudaIpcOpenMemHandle via HIP/CUDA parity) and a symmetric heap: a same‑size allocation on every rank whose handles are shared so kernels can compute remote pointers. They describe the “translate” trick to map an object’s offset from a local heap base to remote base addresses, and note constraints (identical allocations across ranks, using fine‑grained memory on AMD). The result: custom kernels can directly dereference remote buffers from inside device code, enabling flexible non‑uniform all‑to‑all patterns, fewer DMA/communication kernels, and opportunities to exploit partial sorts and better overlap of communication and grouped GEMM in large MoE deployments.
Loading comments...
loading comments...