Pitfalls of Unified Memory Models in GPUs (www.infoq.com)

🤖 AI Summary
A talk examining "Pitfalls of Unified Memory Models in GPUs" highlights surprising, real-world problems that arise when developers rely on CUDA’s managed (unified) memory. The speaker recounts writing a tiny CUDA memset/memcpy variant using cudaMallocManaged and finding it “sometimes works, sometimes doesn’t” depending on hardware and driver/software. The core causes are rooted in differences between CPU and GPU programming models (implicit massive concurrency on GPUs, factory-like specialization), the variety of CUDA memory types (device-only cudaMalloc, host-mapped allocations, and managed/unified memory), and the runtime’s opaque migration of pages between host and device. Tools like strace reveal driver/OS involvement (file descriptors and system calls) that can hide costly migrations and synchronization, producing nondeterministic correctness and performance. For the AI/ML community this matters because large models and data pipelines are extremely sensitive to memory locality and predictable throughput: GPUs expect high compute-to-data ratios, and unexpected page faults or driver-mediated copies can stall hundreds of threads. Key technical implications: use streams (cudaStream_t) to control ordering and overlap, prefer explicit device allocations or pinned host memory for performance-critical paths, and profile to detect hidden migrations. Unified memory is convenient for correctness, but it can mask expensive migrations and nondeterministic behavior—so treat it cautiously for production training or inference workloads.
Loading comments...
loading comments...