🤖 AI Summary
Google today released the Gemini 2.5 Computer Use model, a specialized variant of Gemini 2.5 Pro that lets agents interact directly with graphical user interfaces (web and mobile) via a new computer_use tool in the Gemini API. The model yields function-style outputs (click, type, scroll, drag, etc.) after analyzing a user request, a screenshot of the UI, and a short action history; client-side code executes the action and returns a new screenshot and URL, creating an iterative loop until the task completes. Google reports that Gemini 2.5 Computer Use outperforms competing systems on multiple web and mobile control benchmarks (including Browserbase’s Online-Mind2Web) while delivering lower latency, and shows strong promise for mobile UIs (not yet optimized for desktop OS control).
Technically, the API supports excluding or extending the set of UI functions, supports end-user confirmations for sensitive steps (e.g., purchases), and integrates with developer tooling like Playwright and Browserbase for local or cloud execution. Safety is emphasized: safety behavior is trained into the model and reinforced by an out-of-model per-step safety service plus developer-configurable system instructions to block or require confirmation for high-risk actions (bypassing CAPTCHAs, compromising security, controlling medical devices, etc.). The model is available in public preview via Gemini API on Google AI Studio and Vertex AI, and early deployments already power UI testing, automation agents, and several internal Google projects.
Loading comments...
login to comment
loading comments...
no comments yet