Can Transformers Do Everything, and Undo It Too? (astro-eric.github.io)

🤖 AI Summary
Two new theoretical papers provoked debate by proving different functional properties of Transformers: one shows that the Transformer mapping from continuous input embeddings to continuous outputs can be surjective (it can reach any point in output space), and the other shows that the map from discrete token sequences to the model’s output embedding is (almost always) injective. Taken raw, these results sound contradictory — “can output anything” versus “different inputs always give different outputs” — but they examine different functions. The surjectivity proof works in continuous-to-continuous settings (no discretization or autoregressive decoding), while the injectivity result concerns discrete tokens → continuous embeddings (and excludes the autoregressive/token-decoding step). Neither paper proves that an actual autoregressive LLM mapping from input token sequences to output token sequences is bijective. Why this matters: surjectivity existentially implies jailbreaks or dangerous behaviors exist if inputs can be arbitrary continuous signals (a real concern for diffusion models or robot controllers that consume continuous inputs). Injectivity suggests, in principle, outputs uniquely encode inputs (a privacy risk), but only existentially — inversion may be computationally or statistically infeasible, and discrete→continuous maps cannot be surjective. Both papers also give algorithms to recover embeddings from outputs when model parameters are known, but recovering original tokens (or doing practical inversion/jailbreaking) remains nontrivial. The practical takeaway: these results flag theoretical safety/privacy vectors, especially in continuous-control generative systems, but do not imply current tokenizer-based, autoregressive LLMs are inherently lossless or trivially invertible in deployed settings.
Loading comments...
loading comments...