🤖 AI Summary
Google DeepMind’s new gemini-2.5-computer-use-preview-10-2025 was put through real-world web tests by Stagehand and Browserbase, and the results are striking: across ~200 experiment runs (~4,000 browser hours) the Gemini 2.5 computer-use models outperformed other major providers on accuracy, speed, and cost using public APIs under identical constraints (e.g., OnlineMind2Web at 50 steps and WebVoyager at 75 steps). Browserbase’s cloud browser infrastructure and observability let the team parallelize runs—condensing what would be ~18 browser hours into 20 minutes—so iteration, failure inspection, and large-scale evals become practical rather than months-long projects.
The report also highlights core challenges of web agent benchmarking—site volatility, captchas, cookie popups, and obsolete tasks—and argues for transparency over quiet pruning. To that end the team published 3,772 human-verified evals with traces and scores, open-sourced the Stagehand Evals CLI, and added day-one TypeScript/Python support and templates for the new model so others can reproduce, inspect, and compare agents. Practical implications: builders can deploy more capable, faster browser agents sooner, while the community gets reproducible tooling and raw data to stabilize benchmarks and accelerate agentic browsing development.
Loading comments...
login to comment
loading comments...
no comments yet