VLAs are dead, long live World Action Models (www.daft.ai)

0 points 4 days ago ago | visit original

🤖 AI Summary

At the Sequoia AI Ascent 2026 conference, Jim Fan from Nvidia introduced a groundbreaking shift in robotics, moving from Vision-Language-Action models (VLAs) to what he terms World Action Models (WAMs). Fan argues that robotics is on a trajectory similar to that of large language models (LLMs), transitioning through stages such as pre-training and action fine-tuning before leveraging reinforcement learning for final enhancements. This new approach focuses on simulating physical interactions and actions more effectively compared to VLAs, which Fan critiques for being overly reliant on language processing. Significantly, Nvidia's WAMs utilize AI-generated video models that inherently learn physics by predicting visual outcomes, thus making physical interactions more intuitive for robots. Additionally, Fan presented innovative data collection strategies, asserting that traditional teleoperation in robotics is becoming obsolete. Instead, he advocates for egocentric video data collection, which can scale dramatically by embedding data capture into everyday experiences. Nvidia’s new Dream Dojo platform further enhances this with neural simulations, allowing robots to learn and operate in virtual environments without complex physics engines. Overall, this transition not only reshapes the capabilities of robotic systems but also suggests an imminent evolution in the field itself.

Loading comments...

loading comments...