How do AI models generate videos? (www.technologyreview.com)

0 points 2 days ago ago | visit original

🤖 AI Summary

This year saw a leap in consumer-accessible video generation: OpenAI’s Sora, Google DeepMind’s Veo 3, and Runway’s Gen‑4 can now produce clips that rival real footage, and Sora/Veo 3 are even available to paying users inside ChatGPT and Gemini apps. That burst of capability is already changing creative workflows (Netflix used AI VFX in a series) while flooding feeds with both impressive content and low-quality fakes — raising worries about misinformation, copyright and biased training data scraped from the web. The tech’s wider availability means anyone can iterate on prompts to get usable results, but creators must compete with “AI slop,” and the process consumes far more energy than image or text generation. Under the hood most modern systems are latent diffusion transformers. Diffusion models learn to reverse noise — starting from static and iteratively “denoising” it into an image — and are guided by a text-to-embedding model so outputs match prompts. To make video tractable, models run diffusion in a compressed latent space (latent diffusion) to cut compute, then use transformers to model long spatiotemporal sequences so objects, lighting and motion stay consistent. OpenAI slices video into small space‑time cubes for transformer processing; DeepMind’s Veo 3 compresses audio and video together so the diffusion process generates synced sound and visuals. Despite efficiency gains from latent-space methods, video remains compute‑heavy, and researchers are exploring diffusion-based alternatives to transformers for text generation — a sign these techniques will spread further across multimodal AI.

Loading comments...

loading comments...