🤖 AI Summary
OneFlow is a new non-autoregressive multimodal model that treats sequence generation as a series of insertions — inserting both text tokens and image embeddings — enabling variable-length, concurrent, mixed-modal generation. Unlike traditional autoregressive pipelines, OneFlow can insert and denoise images anywhere inside a text stream in parallel, supporting simultaneous interleaved text–image generation. The architecture also adapts classifier-free guidance (CFG) to image understanding, letting users trade off fidelity and specificity in multimodal outputs. During sampling OneFlow exhibits hierarchical behavior that resembles implicit visual reasoning: it constructs intermediate visual/textual steps before producing a final answer, without explicit chain-of-thought prompts.
In benchmarks OneFlow is competitive with state-of-the-art models for both image generation and image understanding; controlled experiments show it scales better than Transfusion (an AR + feed-forward mixture) during multimodal pretraining. The authors also find that mixed-modal training consistently boosts performance on both generation and understanding tasks. Practical implications include faster parallel generation, richer interleaved multimodal outputs, and a new avenue for classifier-free guidance in vision-language systems — with a caveat that higher CFG weights produce longer, more detailed captions but increase hallucination risk. Overall, OneFlow points to a promising direction for efficient, flexible multimodal generation and implicit reasoning in joint image-text models.
Loading comments...
login to comment
loading comments...
no comments yet