Torchcomms: A modern PyTorch communications API (pytorch.org)

🤖 AI Summary
Meta announced torchcomms, an experimental, open-source communications API for PyTorch Distributed (PTD) plus a new backend stack—NCCLX (and its CTran transport)—designed to scale model training to 100k+ GPUs. The release provides foundational, object-oriented communicator APIs that bind a communicator to a single device, eager backend initialization, and explicit resource management. Meta also open-sourced NCCLX/CTran (used in production for Llama3/Llama4 services), and added native RCCL and upgraded Gloo features, enabling multi-vendor, heterogeneous deployments and inviting community feedback as the API evolves. Technically, torchcomms emphasizes large-scale and device-centric patterns: one-sided RDMA semantics, zero-copy/NVLink transfers, custom collective algorithms, network traffic load balancing, and GPU-resident collectives. New primitives include window APIs for remote Put/Get on registered GPU/CPU buffers, transport APIs for direct RDMA point-to-point writes, and batch semantics for concurrent ops. The design targets fault tolerance at scale, explicit communicator/resource hints for scaling beyond NCCL limits, and integration with DeviceMesh/torchtitan (FSDP2 compatibility). Early-stage breaking changes are possible, but the long-term plan is to migrate c10d onto torchcomms, making distributed PyTorch more extensible, performant, and resilient for next‑generation large-scale AI systems.
Loading comments...
loading comments...