🤖 AI Summary
This paper challenges a core convention in denoising diffusion generative models: instead of training networks to predict the noise added to an image (the common epsilon-parameterization), the authors argue for directly predicting the clean image (x0). They ground this in the manifold assumption—natural images lie on a low-dimensional manifold while noised quantities do not—so predicting clean data reduces the effective dimensionality the model must learn. The result is a conceptual "back to basics" shift that lets relatively compact networks operate effectively in high-resolution, high-dimensional settings where noise-targeted models can struggle or fail.
Technically, the authors demonstrate that simple pixel-space Transformers with large patch sizes (16 and 32), no tokenizer, no pretraining and no auxiliary losses—dubbed JiT (Just image Transformers)—are strong generative models under this clean-target diffusion paradigm. On ImageNet at 256 and 512 resolution they report competitive results and more stable behavior in regimes where predicting noised quantities is prone to catastrophic failure. For the AI/ML community this implies a pragmatic architectural and objective redesign: aligning targets with the data manifold can reduce required capacity, simplify training pipelines, and revive straightforward transformer architectures for high-resolution image synthesis.
Loading comments...
login to comment
loading comments...
no comments yet