Emu3.5: Native Multimodal Models Are World Learners (arxiv.org)

0 points 1 day ago ago | visit original

🤖 AI Summary

Emu3.5 is a new end-to-end multimodal "world model" that natively ingests and generates interleaved vision-and-language sequences. Trained with a unified next-token prediction objective on a massive corpus of >10 trillion tokens—primarily sequential video frames paired with transcripts—the model outputs mixed image/text streams and was further post-trained with large‑scale reinforcement learning to boost multimodal reasoning and generation. A key systems innovation, Discrete Diffusion Adaptation (DiDA), replaces slow token-by-token autoregressive decoding with a bidirectional, parallel discrete-diffusion style predictor, yielding roughly a 20× speedup for per-image inference without measurable quality loss. Emu3.5 demonstrates long-horizon vision-language generation, any-to-image (X2I) synthesis, complex text-rich image creation, spatiotemporally consistent scene modeling, and open-world embodied manipulation. For the AI/ML community this matters because Emu3.5 shows that unified, video-derived next-token training plus RL fine-tuning can produce a single model that reasons, generates, and plans across time and modality—abilities important for agents, video understanding, and consistent multimodal editing. DiDA’s parallel decoding is a practical advance for making multimodal generation real-time. Empirically, Emu3.5 matches Gemini 2.5 Flash Image on image gen/edit benchmarks and outperforms on interleaved generation tasks; the authors have open‑sourced the model to accelerate research and reproducible comparison.

Loading comments...

loading comments...