The bitter lesson for web agents (yutori.com)

🤖 AI Summary
Researchers argue that modern web agents that act like humans—by visually rendering pages and reading screenshots—generalize far better than DOM-only agents because the web’s implementations are extremely heterogeneous. They demonstrate with concrete failures: nba.com is a React SPA that fills scores via an AJAX JSON endpoint after initial HTML load (so the initial DOM contains only a skeleton), while a Squarespace product page hides stock info inside a server-rendered JSON blob that isn’t visually obvious in the raw DOM. Heuristic DOM parsing (looking for <button>/<a>, onclick, ARIA, cursor styles) also misses real interactive elements — e.g., an ArXiv “Export BibTeX” control implemented as a non-interactive <span>. Multimodal embedding experiments further show that DOM outliers (weird HTML) do not always correspond to visual outliers, so DOM structure is a brittle signal. The practical costs are large: sampling 250 pages produced ~39.6M GPT-token-equivalent DOM tokens (avg ~158k/token per page; max ~962k), implying ~18M tokens/day per Scout agent at 115 interactions/day — an expensive and fragile pipeline. The takeaway for the AI/ML community is clear: DOM parsing buys structure but is costly and brittle at scale; vision-based, human-like interaction and multimodal models yield more robust generalization across the “Cambrian Explosion” of web implementations. Yutori positions its Scouts product around this insight, favoring visual agents to minimize site-specific fixes and scale reliably.
Loading comments...
loading comments...