LLMs Don't Know Their Own Decision Boundaries (arxiv.org)

🤖 AI Summary
Researchers evaluated whether large language models can reliably explain their own predictions via self-generated counterfactual explanations (SCEs)—text edits that would make the model predict a different outcome. They measured two core properties: validity (the counterfactual actually flips the model’s prediction) and minimality (the edit changes the input no more than necessary). Across multiple LLMs, datasets and evaluation settings, the authors found a consistent validity–minimality trade-off: when free to produce counterfactuals, models usually generate valid SCEs but with large, non-minimal edits that reveal little about decision boundaries; when explicitly asked to be minimal, models often make tiny edits that fail to change the prediction. In short, SCEs are either uninformative or incorrect. This result matters because SCEs are a natural, intuitive explainability tool for human–AI collaboration. The study shows they can give a false sense of understanding—either by hiding the true decision boundary behind sweeping changes or by offering misleading “minimal” changes that don’t actually affect behavior. Technically, the finding implies that LLMs do not internally model or report their decision boundaries reliably, so practitioners should be cautious about relying on self-explanations in high-stakes settings and consider external, model-agnostic counterfactual generators, rigorous intervention tests, or calibrated uncertainty estimates. The authors publish code and data to reproduce the analyses.
Loading comments...
loading comments...