Language Models Are Injective and Hence Invertible (arxiv.org)

0 points 6 days ago ago | visit original

🤖 AI Summary

Researchers prove that transformer language models map discrete input token sequences to unique continuous hidden-state sequences — i.e., the mapping is injective — and that this invertibility holds at initialization and is preserved through training. They counter the common intuition that non-linearities and normalization necessarily create collisions by providing a formal proof, then validating it empirically with billions of collision tests across six state-of-the-art LMs (finding no collisions). Building on this, the authors present SipIt, an algorithm that provably and efficiently reconstructs the exact input text from a model’s hidden activations, with linear-time guarantees, demonstrating exact invertibility in practice. The result has direct technical and operational implications: injectivity means hidden states can leak full input content (raising privacy and model-inversion concerns) but also enables new tools for interpretability, debugging, dataset attribution, and transparency because internal activations can be mapped back to original tokens. Practically, SipIt shows that, given access to activations, exact recovery is feasible and efficient. This reframes how the community should think about internal representations, defenses against leakage, and how to leverage invertibility for safe deployment and introspection of language models.

Loading comments...

loading comments...