Self-Correction Bench: Revealing and Addressing LLM Self-Correction Blind Spot (arxiv.org)

0 points 3 hours ago ago | visit original

🤖 AI Summary

Researchers introduce Self-Correction Bench, a focused evaluation exposing a systematic failure mode in large language models: they can fix identical errors when shown externally but frequently fail to correct mistakes in their own outputs — a phenomenon the authors call the Self-Correction Blind Spot. Using controlled error injection at three complexity levels, the paper evaluates 14 open-source non-reasoning LLMs and reports an average blind spot rate of 64.5%. The bench quantifies how often models ignore or fail to revise their own flawed chains of thought even though they can revise the same content when presented as external input. The work links the blind spot to training distribution: human demonstrations used in supervised finetuning rarely include explicit error-correction examples, while models trained with RL-style outcome feedback better learn corrective behavior. Importantly, the authors show this capability is latent — appending a minimal “Wait” prompt before asking the model to self-correct reduces blind spots by 89.3%, suggesting simple prompting strategies can unlock dormant self-revision skills. The finding is significant for AI/ML practitioners because it identifies a reproducible safety risk in deployment and offers both evaluation tooling and a low-cost mitigation pathway to improve trustworthiness in safety-critical applications.

Loading comments...

loading comments...