Parrot – A C++ library for fused array operations using CUDA/Thrust (nvlabs.github.io)

0 points 11 hours ago ago | visit original

🤖 AI Summary

Parrot is a C++ library that offers GPU-accelerated, fused array operations built on CUDA/Thrust. It provides chainable, high-level array primitives with "fused evaluation" semantics: any operations that can be fused are fused, avoiding unnecessary intermediate allocations and kernel launches. The API is idiomatic C++ (e.g., range().as<float>().reshape(...), maxr<2>().replicate(...), exp(), sum<2>()) so common patterns like a row-wise softmax can be expressed succinctly and executed efficiently on the GPU. For the AI/ML community this matters because fusion reduces memory bandwidth pressure and launch overhead—two major bottlenecks in GPU numerical code—so kernels derived from chained expressions often run much faster than composing separate Thrust kernels or materializing temporaries. Parrot’s reliance on CUDA/Thrust makes it easy to drop into existing C++/CUDA toolchains and serve as a middle ground between hand-written kernels and heavier frameworks. A provided softmax benchmark and performance comparison illustrate these gains; developers working on custom data pipelines, inference kernels, or compact ML runtimes can use Parrot to simplify code while improving throughput and memory efficiency.

Loading comments...

loading comments...