Open Data Synthesis for Deep Research (arxiv.org)

🤖 AI Summary
Researchers introduce InfoSeek, a scalable synthetic-data framework designed to teach LLMs to do "Deep Research"—multi-step question answering that requires decomposing problems, coordinating hierarchical reasoning, and synthesizing evidence with verifiable answers. They formalize these tasks as Hierarchical Constraint Satisfaction Problems (HCSPs), arguing that existing QA benchmarks (e.g., Natural Questions, HotpotQA) and many synthetic datasets lack the structural depth, verifiability, or pose shortcut opportunities that real research-style queries demand. InfoSeek builds "Research Trees" by having two agents recursively extract and blur intermediate nodes from large-scale webpages, then converts those trees into natural-language questions that force models to traverse the full hierarchy. Technically, InfoSeek produces over 50K training examples, a curated test set, and reasoning trajectories generated through reject sampling, while preserving meta-information like intermediate steps and retrieval labels. Models fine-tuned on InfoSeek outperformed strong baselines: notably, 3B models optimized with InfoSeek beat much larger 32B models and some commercial APIs (e.g., Gemini2.5-Flash) on the BrowseComp-Plus benchmark, approaching performance of top-tier APIs (e.g., Gemini2.5-Pro). By exposing structured trajectories and retrieval signals, InfoSeek also enables advanced optimizations—compound reward design and trajectory-level exploration—making it a practical resource to push LLMs toward robust, verifiable multi-step research reasoning.
Loading comments...
loading comments...