Writing my own communications library – a worklog of creating Penny part 1 (szymonozog.github.io)

0 points 11 hours ago ago | visit original

🤖 AI Summary

The author has started Penny, a personal reimplementation of NCCL aimed at being a drop-in communications backend for LLM serving. The immediate goal is to implement a fast, correct AllReduce (the dominant collective for inference) and to be able to swap Penny for NCCL with minimal performance hit. Rather than redoing every low-level RDMA/NVLink detail, they choose NVIDIA’s NVSHMEM (an OpenSHMEM-based library with a GPU device API) to enable in-kernel GPU-to-GPU transfers and focus engineering effort on algorithms and correctness rather than reinventing DMA plumbing. Technically, the worklog covers GPU topology (DGX nodes with NVLink and multiple NICs; prefer InfiniBand NICs for internode links), NVSHMEM’s symmetric-heap model and its constraints (all processes allocate same-sized buffers at the same offsets; buffers must be nvshmem_malloc’d and registered), and two transfer paradigms: put vs get. The author prefers put and uses nvshmemx_putmem_block (block/warp variants trade throughput for resource use and require sync). For flexible multi-process startup they initialize NVSHMEM with a UUID distributed via an NCCL-backed PyTorch dist all_gather, then call nvshmemx_init_attr. A simple exchange kernel demonstrates in-kernel puts into a symmetric heap followed by a sync and local copy; next steps are single-node AllReduce and multi-node scaling. The approach highlights how NVSHMEM’s device API enables compact, high-performance collectives tailored to LLM inference without rebuilding NIC-level drivers.

Loading comments...

loading comments...