🤖 AI Summary
Researchers introduced Sparse VideoGen2, a training-free framework that speeds up inference for video diffusion models by combining a lightweight Semantic-Aware Permutation with efficient dynamic attention kernels. Instead of relying on static sparse patterns (local windows or strided attention) that misidentify important tokens and cause irregular memory access, the method rearranges tokens on-the-fly (per timestep and per layer) so semantically similar tokens are contiguous in memory. Q and K/V tokens are permuted differently to improve selection accuracy, yielding a Pareto-frontier tradeoff of high visual fidelity and much faster generation.
On the kernel side, the team built custom CUDA implementations of dynamic block-size attention compatible with FlashAttention-2 and FlashAttention-3. These kernels handle variable-sized clusters—keeping a loose dependency on K/V cluster size for efficiency even with many small clusters—while using larger query block sizes to sustain high TFLOPs. Together, semantic clustering plus hardware-aware, dynamic block-sparse attention turns theoretical sparsity into practical speedups and less wasted compute. The approach was validated on Wan 2.1 and HunyuanVideo benchmarks, showing significant inference acceleration without retraining, making it immediately useful for deploying faster high-quality video generation on GPUs.
Loading comments...
login to comment
loading comments...
no comments yet