Can today’s AI video models accurately model how the real world works? (arstechnica.com)

🤖 AI Summary
DeepMind’s new paper “Video Models are Zero-shot Learners and Reasoners” puts the Veo 3 video model through thousands of synthetic trials across dozens of tasks—perception, physical modeling, manipulation and reasoning—to test whether generative video systems are learning anything like a real-world “world model.” The authors claim impressive zero-shot abilities (solving tasks they weren’t explicitly trained on) and argue video models could become unified, generalist vision foundation models. In practice Veo 3 shines on many low- and mid-level tasks: it reliably synthesizes short action videos (e.g., robotic hands opening jars, throwing/catching), and performs near-perfectly on deblurring/denoising, inpainting, and edge detection. But those wins coexist with highly inconsistent performance on higher-level reasoning and robust physical understanding, and the paper’s own evaluation suggests overall capability is still weak—framed by the authors as roughly an “8 percent” passing grade. The takeaway for the AI/ML community is twofold: video models do capture useful dynamics and visual priors that enable strong zero-shot performance on many generative and image-restoration tasks, yet they remain brittle for sustained reasoning, long-horizon physical prediction, and generalization across diverse scenarios. Progress will hinge on tougher benchmarks, better grounding of physical priors, multimodal integration, and architectures that preserve long-term coherence rather than optimistic extrapolation from isolated successes.
Loading comments...
loading comments...