We reverse-engineered Flash Attention 4 (modal.com)

🤖 AI Summary
Researchers reverse-engineered Flash Attention 4 (FA4), the new CUDA kernel optimized for NVIDIA’s Blackwell GPUs and reported to deliver roughly a 20% speedup over cuDNN attention. With FA4’s source public, the authors traced its tile-based, bf16 attention pipeline: query tiles are loaded to shared memory, keys/values are streamed in, Tensor Cores compute unnormalized scores, a Softmax warp normalizes scores in tensor memory, a Correction warp conditionally rescales outputs, and an Epilogue writes results back to global memory. The biggest innovation isn’t a single math trick but a much more complex asynchronous pipeline: work is split into warp-specialized stages (32-thread warps), with the warp schedulers rapidly switching between pipeline steps to hide latency and maximize hardware utilization. Two notable technical advances were uncovered: a fast approximate exponentiation implemented as a cubic polynomial so the Softmax can use abundant CUDA cores instead of scarce SFUs, and a smarter online softmax scaling that only rescales when the running maximum changes enough to matter, cutting rescale operations by ≈10×. These changes improve throughput and reduce SFU contention while preserving acceptable numerical stability for large-scale generative workloads. The reverse engineering highlights a broader shift in GPU programming toward manually managed asynchrony and tile/warp specialization—raising the bar for inference engine authors and motivating lower‑level DSLs and libraries (CUTLASS, CuTe, CuTile) to tame this complexity.
Loading comments...
loading comments...