Emu3.5: Native Multimodal Models Are World Learners [pdf] (emu.world)

0 points 11 hours ago ago | visit original

🤖 AI Summary

BAAI’s Emu3.5 is a large-scale "native" multimodal world model that learns to predict the next state across vision and language in a single, unified next-token objective. Trained end-to-end on a >10 trillion-token corpus dominated by sequential video frames and transcripts, Emu3.5 accepts interleaved vision–language inputs and directly generates interleaved vision–language outputs. It was further fine-tuned with large-scale reinforcement learning to improve multimodal reasoning and generation, and BAAI has open-sourced the model to accelerate community research. Key technical advances include a unified architecture and tokenizer for visual and textual tokens, diverse pretraining data (video-interleaved, vision–text pairs, any-to-image, and text-only), and a novel inference speedup called Discrete Diffusion Adaptation (DiDA) that turns autoregressive token-by-token decoding into bidirectional parallel prediction—yielding roughly a 20× per-image acceleration without measurable performance loss. Emu3.5 demonstrates long-horizon vision–language generation, any-to-image (X2I) and complex text-rich image synthesis, plus generalizable world-model abilities like spatiotemporally consistent exploration and open-world embodied manipulation. Empirically it matches state-of-the-art image generation/editing (comparable to Gemini2.5 FlashImage/NanoBanana) while outperforming competitors on interleaved multimodal tasks, positioning Emu3.5 as a practical foundation for embodied and interactive multimodal AI.

Loading comments...

loading comments...