UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs (arxiv.org)

🤖 AI Summary
Researchers have introduced UnpredictaBench, a new benchmark designed to evaluate how well large language models (LLMs) can capture underlying distributions in various sampling scenarios. As LLMs are increasingly utilized in contexts like economic simulations, their tendency to converge on a limited set of plausible answers undermines their ability to model the unpredictability inherent in real-world systems. UnpredictaBench tackles this issue by isolating the capability of LLMs to generate outputs that reflect specified target distributions, ranging from statistical distributions to natural-language descriptions of random processes. The benchmark comprises 448 distinct problems and employs a novel evaluation metric, KS@N, based on the Kolmogorov-Smirnov statistical test to measure how accurately model outputs align with ground-truth samples. The significance of UnpredictaBench lies in its revelation of the current limitations of LLMs in distributional sampling, with performance scores varying dramatically—many models achieving less than 40% on KS@100, despite attempts to enhance outcomes through reasoning. This benchmark highlights that even foundational distributional tasks remain challenging for LLMs, signaling a substantial gap in their capabilities that must be addressed for broader applicability in complex simulations. Therefore, UnpredictaBench represents a crucial step forward in advancing the understanding and development of LLMs for tasks that require a nuanced grasp of randomness and variability.
Loading comments...
loading comments...