Multimodal Diffusion Language Models for Thinking-Aware Editing and Generation (github.com)

0 points 2 hours ago ago | visit original

🤖 AI Summary

Researchers introduced MMaDA-Parallel, a new class of multimodal diffusion language models that run text and image generation in parallel to avoid a key failure of sequential, autoregressive “thinking-aware” pipelines: error propagation from intermediate reasoning that becomes misaligned with the final image. To quantify this, they publish ParaBench, a benchmark that evaluates both text and image outputs and shows degradation in prior models is strongly correlated with poor alignment between generated reasoning and resulting visuals. MMaDA-Parallel achieves continuous, bidirectional interaction between modalities across the entire denoising trajectory and improves Output Alignment by 6.9% over the prior state-of-the-art Bagel. Technically, the model is trained with supervised fine-tuning using a uniform mask predictor that predicts masked image and text responses in parallel, then further optimized with Parallel Reinforcement Learning (ParaRL) which applies semantic rewards along the trajectory to enforce cross-modal consistency. During sampling the model decodes image and text jointly (parallel decoding), enabling intertwined updates that keep reasoning and pixels synchronized. The team released code and two 8B checkpoints (MMaDA-Parallel-A using Amused-VQ tokenizer and MMaDA-Parallel-M using Magvitv2), plus training/inference scripts (Torch 2.3.1+). Validation so far is on synthetic domains (environments, still life, architecture, landscapes); performance on real-world photos and faces remains to be tested as datasets are expanded.

Loading comments...

loading comments...