Testing Language Models: Engineering Confidence Without Certainty (www.gojiberries.io)

0 points 1 day ago ago | visit original

🤖 AI Summary

Modern LLMs break the old engineering assumptions of deterministic, enumerable testing: the same prompt can yield different outputs and natural language creates an essentially infinite input space, while core models are often outside the quick edit-compile-test loop. The paper argues we must stop chasing a single “golden” output and instead engineer probabilistic bounds on risk by (1) constraining system behavior, (2) gathering representative, scenario-based evidence, and (3) mapping fixes to realistic intervention levers. That shift matters because it turns invisible drift into measurable degradation and forces teams to reason about frequency, severity, and fixability rather than unchecked correctness claims. Practically, correctness becomes property-based: structural contracts (schemas, types), invariance/metamorphic relations (paraphrase equivalence, monotonicity), and robustness/ablation tests (placebo resistance, sufficiency/necessity). Acceptance is probabilistic: use sample-size rules (e.g., “rule of three” gives ~3/N upper bound with zero failures; ~300 trials ≈ 1% failure at 95% confidence), interval estimators (Wilson/Clopper–Pearson), and control for dependence and multiple comparisons (vary seeds/time; Bonferroni/FDR corrections). Partition the input space into a scenario grid (task, domain, language, length, retrieval quality, adversarial pressure) to target sampling, align with production mix, and detect drift. Quantify risk as frequency × cost, price abstention against mistaken answers, and calibrate confidence using proxies (sample agreement, independent verifiers, citation grounding) with simple calibrators (logistic/isotonic). Finally, map error classes to intervention points—from output constraints to retrieval/prompt engineering, validators, and model adaptation—applying the minimal effective fix while documenting claims in a concise safety case.

Loading comments...

loading comments...