A Gentle Introduction to CUDA PTX (philipfabianek.com)

🤖 AI Summary
NVIDIA-focused primer explains Parallel Thread Execution (PTX) — the virtual ISA that sits between CUDA C++ and GPU hardware — and provides a hands-on playground (add_kernel.ptx + main.cu) so developers can write, load and JIT-run PTX using the low-level CUDA Driver API (cuLaunchKernel). The post shows why PTX matters: it provides forward compatibility across GPU generations, exposes cutting‑edge instructions (e.g., wgmma warpgroup matrix ops) before C++ wrappers exist, and lets frameworks like Triton emit PTX and rely on the driver to generate hardware-specific SASS. Technically, the article walks through a complete element-wise vector-add kernel in PTX, highlighting the two-stage compilation model (PTX → ptxas → SASS), the file preamble (.version, .target sm_70, .address_size 64), and the explicit kernel signature (.visible .entry with .param types). It details register allocation (.reg .b64, .f32, .pred), common opcodes and syntax (output-before-input format), and core instructions used in the example: mov, mad.lo.s32 to compute global thread index, setp/.bra for bounds checks, mul.wide.s32 and add.s64 for byte offsets, ld.global.f32 and st.global.f32 for memory I/O. The repo includes build/run commands (nvcc main.cu -o ptx_runner -lcuda; ./ptx_runner) and is a practical starting point for performance analysis, hardware feature access, and learning how PTX maps high‑level kernels to GPU execution.
Loading comments...
loading comments...