🤖 AI Summary
Researchers reverse-engineered Flash Attention 4 (FA4), the new CUDA kernel optimized for NVIDIA’s Blackwell GPUs and reported to deliver roughly a 20% speedup over cuDNN attention. With FA4’s source public, the authors traced its tile-based, bf16 attention pipeline: query tiles are loaded to shared memory, keys/values are streamed in, Tensor Cores compute unnormalized scores, a Softmax warp normalizes scores in tensor memory, a Correction warp conditionally rescales outputs, and an Epilogue writes results back to global memory. The biggest innovation isn’t a single math trick but a much more complex asynchronous pipeline: work is split into warp-specialized stages (32-thread warps), with the warp schedulers rapidly switching between pipeline steps to hide latency and maximize hardware utilization.
Two notable technical advances were uncovered: a fast approximate exponentiation implemented as a cubic polynomial so the Softmax can use abundant CUDA cores instead of scarce SFUs, and a smarter online softmax scaling that only rescales when the running maximum changes enough to matter, cutting rescale operations by ≈10×. These changes improve throughput and reduce SFU contention while preserving acceptable numerical stability for large-scale generative workloads. The reverse engineering highlights a broader shift in GPU programming toward manually managed asynchrony and tile/warp specialization—raising the bar for inference engine authors and motivating lower‑level DSLs and libraries (CUTLASS, CuTe, CuTile) to tame this complexity.
Loading comments...
login to comment
loading comments...
no comments yet