🤖 AI Summary
AA-Omniscience is a new benchmark that evaluates large language models’ factual recall and self-knowledge across domains rather than just general capabilities. It contains 6,000 questions drawn from authoritative academic and industry sources, spanning 42 economically relevant topics across six domains. The core metric is the Omniscience Index (bounded −100 to 100), which jointly penalizes hallucinations and rewards abstention when the model expresses uncertainty; a score of 0 means a model answers correctly as often as it answers incorrectly. This design explicitly measures both accuracy and calibration, forcing models to trade off risky guessing against safe abstention.
Evaluations show persistent factuality and calibration weaknesses on frontier models: Claude 4.1 Opus leads with a 4.8 Omniscience score and is one of only three models scoring above zero. Performance is domain-dependent and led by different labs across the six areas, indicating that overall leaderboard rank is a poor proxy when domain-specific knowledge reliability matters. The findings underscore practical implications for deployment—models need better uncertainty estimation, abstention policies, and retrieval or fine-tuning for specific domains—and reinforce that model selection should be driven by use-case demands rather than aggregate capability metrics.
Loading comments...
login to comment
loading comments...
no comments yet