🤖 AI Summary
Mind2Web presents a multimodal dataset and annotation protocol aimed at building generalist web agents that can perform and generalize user tasks across pages, websites, and domains. Each example pairs a natural-language task description with a precise action sequence composed of (Operation, Target Element) steps — using four primitive operations: Click, Hover, Type, and Select — plus rich webpage snapshots (MHTML, DOM snapshot with layout/style, screenshot image), HAR files for network replay, and a complete interaction trace. The dataset is explicitly organized to evaluate cross-task, cross-website, and cross-domain generalization, letting researchers measure how agents transfer learned behaviors within the same site, across sites in a domain, and across entirely different environments.
Data were collected on Amazon Mechanical Turk in three phases: workers propose feasible tasks, demonstrate them while a Playwright-based annotation tool records interaction traces and page snapshots, and authors verify correctness of actions and descriptions. Technically, Mind2Web enables research on element-grounded language understanding, imitation learning and offline RL for web control, multimodal state representations (DOM + pixels + network), and reproducible environment replay via HAR and traces. By providing element-level action targets and diverse snapshot modalities, it supports training agents that must reason about layout, semantics, and dynamic page behavior—key steps toward robust, generalist web automation and assistant systems.
Loading comments...
login to comment
loading comments...
no comments yet