🤖 AI Summary
The author has started Penny, a personal reimplementation of NCCL aimed at being a drop-in communications backend for LLM serving. The immediate goal is to implement a fast, correct AllReduce (the dominant collective for inference) and to be able to swap Penny for NCCL with minimal performance hit. Rather than redoing every low-level RDMA/NVLink detail, they choose NVIDIA’s NVSHMEM (an OpenSHMEM-based library with a GPU device API) to enable in-kernel GPU-to-GPU transfers and focus engineering effort on algorithms and correctness rather than reinventing DMA plumbing.
Technically, the worklog covers GPU topology (DGX nodes with NVLink and multiple NICs; prefer InfiniBand NICs for internode links), NVSHMEM’s symmetric-heap model and its constraints (all processes allocate same-sized buffers at the same offsets; buffers must be nvshmem_malloc’d and registered), and two transfer paradigms: put vs get. The author prefers put and uses nvshmemx_putmem_block (block/warp variants trade throughput for resource use and require sync). For flexible multi-process startup they initialize NVSHMEM with a UUID distributed via an NCCL-backed PyTorch dist all_gather, then call nvshmemx_init_attr. A simple exchange kernel demonstrates in-kernel puts into a symmetric heap followed by a sync and local copy; next steps are single-node AllReduce and multi-node scaling. The approach highlights how NVSHMEM’s device API enables compact, high-performance collectives tailored to LLM inference without rebuilding NIC-level drivers.
Loading comments...
login to comment
loading comments...
no comments yet