Continuous Nvidia CUDA Profiling in Production (www.polarsignals.com)

🤖 AI Summary
Polar Signals today released what they believe is the first open-source, low-overhead NVIDIA CUDA profiler designed for always-on production use as part of parca-agent v0.43.0. It avoids the invasiveness and heavy overhead of tools like Nsight by combining a small injected shim library (parcagpu), NVIDIA’s CUPTI, lightweight USDT probes, eBPF, and the kernel’s perf event ring buffer to stream per-kernel timing and context into the parca-agent with minimal copies and no filesystem/network serialization. Technically, parcagpu is injected into CUDA apps via CUDA_INJECTION64_PATH and subscribes to CUPTI runtime callbacks and activity records. It exposes two USDT probes (cuda_correlation on launch and kernel_executed on completion) that eBPF programs attach to (discovered via dlopen and .note.stapsdt). eBPF writes probe data directly into perf ring buffers for near zero-copy handoff to the Go agent. Correlation IDs match CPU stack traces at launch to GPU hardware-timestamped execution records, and the system supports regular launches and CUDA Graph replay (graph/node IDs), emits device/stream labels, and works on AMD64 and ARM64. For ML/AI teams this enables safe continuous profiling in production to pinpoint kernel hotspots, stream-level parallelism, and graph inefficiencies without the high overhead of traditional profilers. To enable: run parca-agent --instrument-cuda-launch and set CUDA_INJECTION64_PATH to libparcagpucupti.so.
Loading comments...
loading comments...