🤖 AI Summary
The paper shows that memorization in transformers can be isolated in weight space using a decomposition grounded in the loss landscape’s curvature: training points that are memorized sit in much sharper-curvature directions than non-memorized points, so ordering weight components by curvature yields a basis that separates memorization-related structure. Using this basis the authors design a curvature-based weight-editing procedure that suppresses unwanted recitation of memorized training data far more effectively than a recent unlearning method (BalancedSubnet) while preserving overall language-model quality (lower perplexity). The technique transfers across model families (LMs and ViTs) and is validated quantitatively against memorized outputs.
Beyond a tool for selective unlearning, the work reveals that some tasks—specifically fact retrieval and arithmetic—depend on narrow, idiosyncratic directions in weight space: editing the curvature-based subspace disproportionately lowers performance on these tasks, while open-book fact lookup and general logical reasoning remain intact. The authors support this with a correlation between task examples’ activation strength in the edited subspace and post-edit performance drops. The result both provides a practical method for reducing memorization/privacy risk and offers mechanistic evidence that certain “reasoning” behaviors rely on highly specialized weight directions rather than distributed, general-purpose computation.
Loading comments...
login to comment
loading comments...
no comments yet