Google PASTA: A collaborative approach to image generation (research.google)

🤖 AI Summary
Google unveiled PASTA (Preference Adaptive and Sequential Text-to-image Agent), an RL agent that turns image generation into a multi-turn collaborative conversation: it proposes slate-based prompt expansions, shows four candidate images, and adapts to a user’s selections to converge on their intent. To train this behavior at scale, researchers collected a foundational dataset of over 7,000 sequential human rater interactions (images produced with Gemini Flash prompts and SDXL renderings) and used it to build a user simulator that generated 30,000+ synthetic interaction trajectories. PASTA’s user model combines a utility predictor and a choice model (built on CLIP encoders and trained with an expectation–maximization procedure to discover latent “user types”), enabling the agent to predict preferences and personalize suggestions on the fly. Technically, PASTA is a value-based agent trained with implicit Q-learning (IQL) that selects optimal slates of prompt expansions from a candidate generator; its objective is to maximize cumulative user satisfaction across turns. Ablations showed simulated data alone didn’t beat baseline and real-only data improved some metrics but didn’t surpass the baseline; training on a mix of real and simulated interactions produced the best results. In head-to-head evaluations, raters preferred PASTA’s final images 85% of the time (notably for abstract prompts). Google open-sourced the sequential rater and simulated datasets, offering a practical framework for making generative models more interactive, preference-aware, and robust to diverse user intents.
Loading comments...
loading comments...