What does OSWorld tell us about AI's ability to use computers? (epoch.ai)

🤖 AI Summary
OSWorld is a 361-task benchmark that measures an AI’s ability to perform everyday computer tasks inside an Ubuntu VM (models interact via Python/pyautogui and are judged on whether a target machine state is achieved). Most tasks are short and realistic — median ~6 atomic actions, ~12% require >20 steps — and span browsers, editors, spreadsheets, image editors and multi-app workflows. Important technical facts: ~15% of tasks are terminal-only, ~30% can be solved by substituting terminal use or Python scripting (e.g., openpyxl/pandas) for GUI actions, ~8% are intentionally impossible and ~10% rely on live web data, while roughly 10% of tasks contain serious errors. The benchmark’s authors have actively revised tasks (a major July update plus ~10% further edits), and evaluations accept any route to the target state rather than penalizing non-GUI work. Implications: saturation of OSWorld would indicate competence at simple, realistic Linux/open‑source workflows, not necessarily true GUI-driven desktop fluency or cross-OS skill. Reported leaderboard gains can reflect better terminal/scripting or code-execution tooling and/or benchmark edits and ambiguity resolution, because many tasks are under-specified and score partly on instruction interpretation. OSWorld is useful but imperfect: its instability, live-data dependency, scripting shortcuts and ambiguous instructions mean progress should be interpreted cautiously, and building a realistic, rigorous computer‑use benchmark remains a hard research challenge.
Loading comments...
loading comments...