Nvidia said GPUDirect RDMA wasn't supported on Spark. We got RDMA working anyway (github.com)

0 points 24 days ago ago | visit original

🤖 AI Summary

A recent development has enabled NVIDIA's NCCL (NVIDIA Collective Communications Library) to support distributed machine learning over direct-connect RDMA (Remote Direct Memory Access) mesh configurations, a setup previously deemed unsupported by Nvidia. This custom NCCL network plugin facilitates communication between nodes across different subnets, solving a critical issue where standard plugins rely either on switched InfiniBand fabrics or slower TCP/IP networking, neither of which accommodates direct-cabled RDMA meshes. By leveraging a three-node triangle mesh of NVIDIA DGX Spark workstations connected with 100Gbps RDMA links, the plugin achieved an impressive effective bandwidth of over 8 GB/s while running distributed inference on the Mistral-7B large language model. This breakthrough holds significant implications for the AI/ML community as it simplifies memory management and enhances performance in distributed systems without requiring additional configurations or kernel modules. The unified memory architecture allows RDMA to operate directly on GPU-accessible memory, resulting in GPUDirect-like performance and improved throughput. As the plugin eliminates conventional deadlocks and facilitates accessible network routing, it opens up new avenues for optimizing computational resources in AI applications, particularly in environments where high-speed, low-latency inter-node communication is critical. The success of this initiative demonstrates that innovation can flourish by challenging the constraints of "supported configurations."

Loading comments...

loading comments...