Characterizing the Latency and Power Regimes of Open Text-to-Video Models (arxiv.org)

🤖 AI Summary
Researchers released a systematic study quantifying the latency and energy costs of open text-to-video (T2V) models, presenting a compute-bound analytical model that predicts how compute and power scale with spatial resolution, temporal length, and denoising steps. They validate the model on WAN2.1-T2V and show empirically that runtime and energy grow roughly quadratically with both spatial and temporal dimensions and linearly with the number of denoising steps—meaning doubling resolution or clip length can quadruple cost, while halving denoising steps roughly halves cost. The paper also benchmarks six diverse open T2V models under default settings to produce comparative runtime and energy profiles. This work is significant because it translates qualitative concerns about “video’s huge energy bill” into quantitative scaling laws and reproducible measurements, giving researchers and practitioners concrete trade-offs between fidelity, length, and computational budget. Key technical implications include prioritizing algorithmic and architectural optimizations (fewer diffusion steps, spatiotemporal sparsity, distillation or frame interpolation), and guiding hardware and deployment choices for sustainable generative-video systems. The study serves as both a benchmark reference and a roadmap for reducing the environmental and economic costs of T2V research and products.
Loading comments...
loading comments...