🤖 AI Summary
Stanford’s team announced Agent S3, a streamlined successor to Agent S2 that advances computer-use agent performance from 48.8% to 62.6% on OSWorld for single-run 100-step tasks (up from 61.4% SOTA by Claude Sonnet 4.5) and, by introducing Behavior Best-of-N (bBoN), pushes accuracy to 69.9%—within a few points of human-level performance (72%). S3 removes the prior hierarchical manager–worker overhead, adds a native coding agent for generating/executing code, and improves single-run reliability by about 13%. Across environments, bBoN also boosts generalization: WindowsAgentArena rises 50.2% → 56.6% and AndroidWorld 68.1% → 71.6%.
The technical novelty is bBoN: instead of trusting one rollout, S3 runs multiple diverse executions, compresses each run into concise "facts" to form behavior narratives, and uses a fact-grounded judge to compare and pick the best attempt. This reduces the core CUA bottleneck—high variance in long-horizon tasks where small errors compound—and demonstrates a practical scaling axis beyond bigger models: diversity and selection of agent behaviors. Results scale with more runs (best with 10 runs; GPT-5 hits 69.9%, GPT-5 Mini 60.2%), showing agentic systems can gain reliability and real-world utility by ensemble-style behavioral selection and interpretable judging.
Loading comments...
login to comment
loading comments...
no comments yet