Mano: Multi-Modal Foundation Model and 3-Stage RL for SOTA GUI Automation (arxiv.org)

0 points 4 hours ago ago | visit original

🤖 AI Summary

Mano is a new GUI automation agent built on a multi-modal foundation model pre-trained on large-scale web and computer-system data, designed to overcome typical VLM shortcomings—low resolution, domain mismatch, and weak sequential decision-making—when interacting with graphical user interfaces. The authors pair this backbone with a high-fidelity simulated environment for synthetic data generation, a three-stage training pipeline (supervised fine-tuning → offline reinforcement learning → online reinforcement learning), and a verification/error-recovery module that detects and corrects failures during execution. The result is state-of-the-art performance on GUI benchmarks such as Mind2Web and OSWorld, with notable gains in task success rate and operational accuracy. Technically, Mano demonstrates how domain-specific pretraining, iterative RL refinement (offline then online), and carefully designed holistic rewards can close the gap between VLM perception and multi-step control. The verification module and simulation-driven dataset are particularly significant: they reduce brittleness in real-world workflows and make RL-based recovery practical. For the AI/ML community, Mano highlights a replicable recipe for deploying multi-modal models in interactive, sequential decision tasks and underscores the value of simulation + staged RL to produce robust, deployable GUI agents.

Loading comments...

loading comments...