Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs (arxiv.org)

🤖 AI Summary
Researchers introduced JensUn, a new unlearning method for large language models that prioritizes stable, durable removal of targeted knowledge while preserving overall model utility. JensUn trains the model to forget and retain specified data using the Jensen–Shannon divergence as the objective for both forget and retain sets, which the authors report yields more stable dynamics and a better forget–utility trade-off than common loss functions. In experiments JensUn not only more effectively removes targeted facts but also shows robustness to “benign relearning,” producing forgetting that is close to irreversible under normal fine-tuning attempts. The paper also raises the bar for evaluating unlearning: it provides LKF, a curated dataset of lesser-known facts for realistic deletion targets, and proposes two stricter evaluation practices — using a large LLM as a semantic judge (instead of surface metrics like ROUGE) and testing worst-case unlearning across paraphrases and varied input formats. Together these reveal that many prior methods perform worse under realistic, adversarial evaluations. The work has practical implications for privacy, safety, and regulatory compliance in deployed LLMs, offering both a technically grounded objective and a tougher benchmark suite for future unlearning research.
Loading comments...
loading comments...