Collective Matrix Multiplication – JAX Pallas:Mosaic GPU (docs.jax.dev)

🤖 AI Summary
JAX developers implemented a Pallas:MGPU kernel that overlaps inter-GPU communication with local matrix multiply to solve a common sharding mismatch: activations sharded by data-parallel batch axis while weights are tensor-partitioned across output features (e.g., at the start of a Transformer MLP). Instead of doing a blocking NCCL all-gather of activations then a cuBLAS matmul, the kernel forms a ring all-gather and streams each LHS shard to the next GPU while concurrently computing A_shard @ B_local. By issuing asynchronous TMA copies from SMEM during the matmul pipeline and coordinating with lightweight semaphores, the implementation keeps NVLINK traffic on the device and overlaps communication and compute to reduce idle time. Technically, the kernel is persistent (launched with one CTA per SM), uses three Pallas threads for warp-specialized matmul (reusing hopper_matmul_mgpu), and employs plgpu.remote_ref, pl.semaphore_signal, and pl.semaphore_wait for direct remote-GMEM access and synchronization. Benchmarks (f16, per-shard M=1024, K=N=4096) show substantial gains: e.g., 8 GPUs: kernel time 436µs (TC util ~64%) vs reference 565µs (49% util), approaching a computed lower bound when accounting for ~6µs memory-fence overhead per round. The approach leverages NVLINK to avoid host traffic and meaningfully improves hardware utilization for large-scale model sharding, though some optimization headroom remains (semaphores and fence costs, backpressure trade-offs).
Loading comments...
loading comments...