OpenEstimate Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data (arxiv.org)

0 points 7 days ago ago | visit original

🤖 AI Summary

OpenEstimate is a new, multi-domain benchmark that tests large language models on real-world numerical estimation tasks where answers are inherently uncertain. Rather than asking for single “correct” responses, the benchmark requires models to synthesize background knowledge and report probabilistic priors, which are then evaluated for accuracy and calibration against samples from the true distribution of interest. The paper argues this fills an important evaluation gap: most LM tests use well-defined questions, while many deployed applications (healthcare, finance, forecasting) need reliable uncertainty-aware estimates. The authors evaluate six frontier LMs and find that elicited priors are frequently inaccurate and overconfident. Performance only improves modestly with different uncertainty-elicitation techniques, and is largely insensitive to sampling strategy, extra chain-of-thought reasoning, or prompt tweaks. Technically, OpenEstimate measures both point-estimate usefulness and probabilistic calibration, offering an extensible platform for domain-specific tasks and richer supervision signals. The results highlight a pressing need for model improvements in probabilistic estimation—better calibration, structured probabilistic outputs, and training or fine-tuning regimes that target distributional reasoning—if LMs are to be trusted in high-stakes, uncertainty-heavy settings.

Loading comments...

loading comments...