Vista: A Test-Time Self-Improving Video Generation Agent (arxiv.org)

0 points 274 days ago ago | visit original

🤖 AI Summary

VISTA (Video Iterative Self-improvemenT Agent) is a multi-agent, test-time system that autonomously improves text-to-video outputs by iteratively rewriting prompts. Given a user idea, VISTA first decomposes it into a structured temporal plan, generates several candidate videos, and selects the best via a robust pairwise tournament. That winner is then critiqued by three specialized agents—visual, audio, and contextual fidelity—and a reasoning agent synthesizes those critiques to introspectively rewrite the prompt for the next generation cycle. On single- and multi-scene benchmarks VISTA yields consistent gains where prior test-time methods were uneven, achieving up to a 60% pairwise win rate versus state-of-the-art baselines and winning human preference tests 66.4% of the time. For the AI/ML community this demonstrates a practical, system-level route to close the loop between generation and evaluation in the complex, multi-modal domain of video. Key technical ideas—task decomposition into temporal plans, tournament-based candidate selection, specialized critiquers, and a central reasoning agent—offer a blueprint for automated prompt engineering and model-agnostic improvements at test time. The approach can reduce manual prompting, improve fidelity to user intent, and potentially extend to other multi-modal generation tasks, though it introduces extra test-time compute and system complexity that will influence deployment trade-offs.

Loading comments...

loading comments...