🤖 AI Summary
Researchers propose statistically reliable ways to use large language model (LLM)–generated relevance labels for information retrieval (IR) evaluation by placing principled confidence intervals (CIs) around IR metrics. Recognizing that synthetic annotations are cheap but error-prone—often with systematic biases—the paper introduces two methods, prediction-powered inference and conformal risk control, that combine a small set of trusted human annotations with mass-produced LLM labels. These methods quantify and correct for annotation errors so reported metrics come with strong theoretical guarantees rather than giving a false sense of accuracy.
Technically, prediction-powered inference leverages a predictive model of annotation errors calibrated on the human-labeled seed, while conformal risk control—novelly tailored for ranking metrics—produces query- and document-level CIs that account for both bias and variance in LLM labels. Experiments show these CIs more reliably capture true uncertainty than standard empirical bootstrap, enabling accurate metric estimation with far fewer manual labels. The contribution makes low-cost, large-scale IR evaluation feasible and trustworthy for research and low-resource applications, offering a practical path to scale evaluation while maintaining statistical rigor.
Loading comments...
login to comment
loading comments...
no comments yet