gdrcopy: Fast CPU-GPU memory copy library based on Nvidia GPUDirect RDMA (github.com)

0 points 5 hours ago ago | visit original

🤖 AI Summary

gdrcopy is a user-space library that exposes GPU memory to the CPU using NVIDIA GPUDirect RDMA APIs so applications can perform CPU-driven copies into and out of GPU buffers with very low software overhead. Rather than using cudaMemcpy, which commonly costs ~6–7 µs, gdrcopy maps GPU VA into host space and lets the CPU read/write as if it were host memory (with caveats). That makes it attractive for latency-sensitive AI/ML workloads (e.g., tight inference paths, RDMA-linked accelerators, or custom staging between host and device) where minimizing copy invocation cost matters. The library and its benchmarks reveal the trade-offs: initial buffer pinning can be expensive (roughly 10 µs–1 ms depending on size), but small-copy API calls are extremely cheap (gdr_copy_to_mapping ~= 0.09 µs for tiny writes in tests) and host→device writes enjoy high sustained rates via write-combining (measured ~6–9.6 GB/s on tested hardware). Device→host reads are much slower (e.g., ~530 MB/s in a sample test) because the GPU BAR mappings can’t be prefetched over PCIe. gdrcopy requires NVIDIA Data Center/RTX GPUs (Kepler+), compatible drivers (driver, CUDA >=6.0), and a kernel module (DKMS or kmod); supported platforms include x86_64, ppc64le and arm64 on major Linux distros.

Loading comments...

loading comments...