BenchmarkQED: Automated Benchmarking of RAG Systems (www.microsoft.com)

🤖 AI Summary
AI research team released BenchmarkQED, an open-source suite to automate large-scale benchmarking of retrieval-augmented generation (RAG) systems. BenchmarkQED bundles three components—AutoQ for synthetic query synthesis across a local→global spectrum, AutoE for LLM-as-a-Judge evaluation across quality metrics (comprehensiveness, diversity, empowerment, relevance, plus correctness when ground truth exists), and AutoD for dataset sampling/summarization to align topical breadth and depth. The toolkit integrates with GraphRAG-style methods and reproduces rigorous, counterbalanced comparisons (pairwise LLM judgments aggregated as win rates), enabling repeatable evaluation across models, metrics, and datasets. In experiments using AP News and a podcast transcript dataset (both released), the team evaluated multiple RAG approaches—including Vector RAG (8k→120k→1M token context), GraphRAG variants, and three published baselines—using GPT-4-series models for generation and GPT-4.1 for judging. LazyGraphRAG configurations (varying query budget b50/b200 and chunk size c200/c600) won virtually all 96 comparisons, often with statistical significance; the best was LGR_b200_c200. Notably, LazyGraphRAG outperformed a Vector RAG with a 1M-token window on most metrics, especially for global queries that require cross-document reasoning. By automating query generation, evaluation, and dataset normalization, BenchmarkQED provides a reproducible framework to stress-test RAG strengths and failure modes and is available on GitHub for community use.
Loading comments...
loading comments...