Flashvsr: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution (zhuang2002.github.io)

🤖 AI Summary
FlashVSR introduces a distillation pipeline that converts a heavy full-attention diffusion teacher into a sparse, causal, one-step student tailored for streaming video super-resolution (VSR). The student runs autoregressively with a KV cache and locality-constrained sparse attention: each query only attends inside a local spatial window and then sparsely to the top-k most relevant regions. This design both eliminates redundant computation and prevents high-resolution artifacts caused by positional encoding periodicity (e.g., RoPE wrap-around when inference extends beyond training ranges), enabling real-time, perceptually strong upscaling on ultra-high-resolution inputs. On the decoder side, a Tiny Conditional (TC) Decoder conditions HR reconstruction on both latents and the low-resolution frames, simplifying the inverse mapping and yielding a ~7× decoding speedup over the WanVAE decoder with visually indistinguishable output. To support large-scale training, the authors assembled VSR-120K (≈120k video clips, avg. >350 frames, plus 180k HR images) curated with LAION‑Aesthetic and MUSIQ quality filters and RAFT motion checks, keeping only >1080p videos with sufficient temporal dynamics; the dataset will be open-sourced in a future release. Together, these contributions bring diffusion-quality VSR into practical, streaming-capable regimes by closing the train–inference resolution gap and dramatically cutting inference cost.
Loading comments...
loading comments...