Show HN: I Parallelized RNN Training from O(T) to O(log T) Using CUDA (dhruvmsheth.github.io)

0 points 23 hours ago ago | visit original

🤖 AI Summary

A Caltech CS179 project implements the “Were RNNs All We Needed?” idea by reworking GRU/LSTM gates so they depend only on the current input, yielding simplified minGRU/minLSTM recurrences of the form h_t = A_t · h_{t-1} + B_t. That lets the sequence solve be framed as composing affine transforms (A,B) with the associative operator (A2A1, A2B1 + B2), so a parallel prefix (Blelloch) scan reduces the dependency chain from O(T) serial steps to O(log T) steps on GPUs. The author built a CUDA scan implementation (with log‑space numerics for stability) and released the code on GitHub to validate the paper’s claim and measure practical speedups. The project highlights both promise and engineering reality: by fusing gate computations into one large kernel and using shared-memory tiling, the pipeline avoids thousands of tiny kernel launches that plague naive GPU implementations. Profiling on an i9-12900K + RTX 4090 shows mixed results—for long sequences GPU-scan wins (T=65,536: CPU-scan ≈ 10,989 ms vs GPU-scan ≈ 5,330 ms), but for mid-length sequences the GPU can lag due to per-step matvec launches (e.g., T=4,096: CPU-scan ≈ 300 ms vs GPU-scan ≈ 340 ms). The main remaining bottleneck was per-step output projection (72% of time); replacing those many matvec kernels with a single cuBLAS GEMM could further cut latency. This work demonstrates that RNN recurrences can be made highly parallel and competitive computationally, while also underscoring practical GPU optimization tradeoffs that determine real-world gains.

Loading comments...

loading comments...