LLM Output Drift in Financial Workflows: Validation and Mitigation (arXiv) (arxiv.org)

🤖 AI Summary
Researchers measured “output drift” — nondeterministic changes in LLM outputs that undermine auditability — across five models (7B–120B) on regulated financial tasks and found counterintuitive results: smaller models (Granite-3-8B, Qwen2.5-7B) produced 100% consistent outputs with greedy decoding (T=0.0), while a 120B model (GPT-OSS-120B) showed just 12.5% consistency (95% CI: 3.5–36.0%), a difference highly significant (p<0.0001). The team ran 480 total trials (n=16 per condition) on three financial workflows (reconciliations, regulatory reporting, client communications) and observed task-dependent sensitivity: structured SQL tasks stayed stable even with some sampling (T=0.2), whereas retrieval-augmented generation (RAG) tasks exhibited 25–75% drift. To mitigate risks they introduce a finance-calibrated deterministic test harness (T=0.0, fixed seeds, SEC 10‑K structure-aware retrieval ordering), task-specific invariant checks for RAG/JSON/SQL using ±5% materiality thresholds and SEC citation validation, a three-tier model classification for risk-appropriate deployment, and an audit-ready attestation system with dual-provider validation. Cross-provider tests show deterministic behavior can transfer between local and cloud deployments. The framework maps to FSB/BIS/CFTC requirements, offering a practical compliance pathway and challenging the assumption that larger LLMs are always preferable for regulated production use.
Loading comments...
loading comments...