Multi-Domain Rubrics Requiring Professional Knowledge to Answer and Judge (arxiv.org)

🤖 AI Summary
ProfBench is a new benchmark and evaluation framework targeting real-world, professional-domain tasks that demand domain knowledge, document synthesis, and report-style answers. The authors release a curated set of over 7,000 response–criterion pairs created and judged by human experts with professional credentials (Physics PhD, Chemistry PhD, Finance MBA, Consulting MBA). The dataset is designed to go beyond short QA and math/programming tests, focusing on whether models can read professional documents, apply domain reasoning, and generate comprehensive, actionable outputs. To scale evaluation affordably, the team builds LLM-based automatic judges (LLM‑Judges) that are calibrated to reduce self‑enhancement bias and cut human-evaluation cost by 2–3 orders of magnitude; code and data are open. Results show ProfBench is challenging even for top models: GPT‑5‑high attains only 65.9% overall, with clear gaps between proprietary and open‑weight models. The study also highlights the importance of “extended thinking” (multi-step reasoning/longer-context strategies) for complex professional tasks. Implications: ProfBench fills a critical evaluation gap for domain-heavy LLM applications, provides a scalable human-grounded judging pipeline, and offers a realistic stress test for model robustness, factuality, and reasoning in professional settings.
Loading comments...
loading comments...