đ¤ AI Summary
Modern LLMs break the old engineering assumptions of deterministic, enumerable testing: the same prompt can yield different outputs and natural language creates an essentially infinite input space, while core models are often outside the quick edit-compile-test loop. The paper argues we must stop chasing a single âgoldenâ output and instead engineer probabilistic bounds on risk by (1) constraining system behavior, (2) gathering representative, scenario-based evidence, and (3) mapping fixes to realistic intervention levers. That shift matters because it turns invisible drift into measurable degradation and forces teams to reason about frequency, severity, and fixability rather than unchecked correctness claims.
Practically, correctness becomes property-based: structural contracts (schemas, types), invariance/metamorphic relations (paraphrase equivalence, monotonicity), and robustness/ablation tests (placebo resistance, sufficiency/necessity). Acceptance is probabilistic: use sample-size rules (e.g., ârule of threeâ gives ~3/N upper bound with zero failures; ~300 trials â 1% failure at 95% confidence), interval estimators (Wilson/ClopperâPearson), and control for dependence and multiple comparisons (vary seeds/time; Bonferroni/FDR corrections). Partition the input space into a scenario grid (task, domain, language, length, retrieval quality, adversarial pressure) to target sampling, align with production mix, and detect drift. Quantify risk as frequency Ă cost, price abstention against mistaken answers, and calibrate confidence using proxies (sample agreement, independent verifiers, citation grounding) with simple calibrators (logistic/isotonic). Finally, map error classes to intervention pointsâfrom output constraints to retrieval/prompt engineering, validators, and model adaptationâapplying the minimal effective fix while documenting claims in a concise safety case.
Loading comments...
login to comment
loading comments...
no comments yet