Benchmarking leading AI agents against Google reCAPTCHA v2 (research.roundtable.ai)

0 points 18 hours ago ago | visit original

🤖 AI Summary

Researchers benchmarked three state-of-the-art AI agents—Anthropic’s Claude Sonnet 4.5, Google’s Gemini 2.5 Pro, and OpenAI’s GPT-5—on Google reCAPTCHA v2 using the Browser Use framework. Across 388 image attempts in 75 trials on Google’s reCAPTCHA demo, Sonnet led with a 60% overall success rate, Gemini followed at 56%, and GPT-5 lagged at 28%. Performance varied by challenge type: all agents did best on Static challenges and worst on Cross-tile challenges (Sonnet static 47.1%/reload 21.2%/cross-tile 0.0%; Gemini static 56.3%/reload 13.3%/cross-tile 1.9%; GPT-5 static 22.7%/reload 2.1%/cross-tile 1.1%). The study attributes GPT-5’s poor showing not to perception alone but to agentic behavior: long, verbose “Thinking” traces, slow stepwise reasoning, and repetitive edits led to timeouts and failure loops—especially when reCAPTCHA refreshed images (Reload) or required handling occluded/boundary-spanning objects (Cross-tile). These dynamics exposed two lessons for AI/ML practitioners: deep reasoning without fast, confident action can be worse than shallower, decisive policies in real-time, interactive settings; and agent architectures (memory/state handling, planning/verification loops) must be robust to dynamic web interfaces. For CAPTCHA designers, results suggest perceptual robustness remains useful, while for developers, optimizing latency, action commitment, and handling of changing state is critical for reliable agentic performance.

Loading comments...

loading comments...