Understanding GPU Architecture (cvw.cac.cornell.edu)

🤖 AI Summary
This roadmap introduces GPU architecture for developers preparing applications to run on GPUs, emphasizing the differences from CPUs and practical implications for GPGPU. It assumes no prior parallel-programming background and uses NVIDIA CUDA sample programs as hands-on exercises. By the end readers should be able to name and compare GPU components (e.g., streaming multiprocessors/SMs, warps/SIMT scheduling, many lightweight cores), understand the memory hierarchy (registers, shared memory, L1/L2 caches, global memory) and their sizes/speeds on specific NVIDIA devices, and explain how those features shape program design. Significant for the AI/ML community, the guide clarifies why GPUs excel at throughput-oriented, data-parallel workloads common in training and inference, and how factors like memory bandwidth, occupancy, warp divergence, and latency-hiding affect performance. Practically, it teaches what classes of software map well to GPUs, how to structure kernels and data movement for high utilization, and how to use the CUDA Toolkit (and access to systems like Frontera or any CUDA-enabled GPU) to profile and tune code. This foundational knowledge helps ML engineers make better architecture choices, optimize kernels, and translate algorithmic ideas into efficient, hardware-aware implementations.
Loading comments...
loading comments...