1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations (arxiv.org)

🤖 AI Summary
A groundbreaking approach to image synthesis has been introduced with VibeToken, a resolution-agnostic autoregressive (AR) model that allows for dynamic image generation across arbitrary resolutions and aspect ratios. Central to this innovation is a 1D Transformer-based image tokenizer capable of encoding images into a compact sequence of 32-256 tokens. This technology significantly enhances efficiency and performance compared to existing methods, allowing VibeToken-Gen to generate high-resolution 1024x1024 images with just 64 tokens, achieving an impressive 3.94 generalized FID (gFID). In contrast, leading diffusion models require 1,024 tokens and yield a higher gFID of 5.87. The significance of VibeToken lies in its potential to democratize the use of autoregressive generative models in production settings by addressing computational efficiency. Unlike traditional fixed-resolution AR models, which see computational demands increase quadratically with resolution—requiring up to 11 trillion floating-point operations (FLOPs) for 1024x1024 images—VibeToken-Gen maintains a constant 179 billion FLOPs, making it approximately 63.4 times more efficient. This efficiency opens new avenues for real-time applications and broader adoption of AR models in various industries, reshaping the landscape of visual generative technology.
Loading comments...
loading comments...