Your Transformer is Secretly an EOT Solver (elonlit.com)

🤖 AI Summary
Researchers showed that scaled dot‑product attention (softmax over query–key dot products with temperature τ = √d_k) is exactly the unique solution of a one‑sided entropic optimal transport (EOT) problem. In this formulation each query supplies unit source mass, the target marginal is left free, the transport cost is −q_i·k_j, and the objective minimizes expected cost minus an entropy term with regularizer ε = τ. Solving the row‑constrained, entropy‑regularized convex program yields the familiar row‑stochastic attention matrix A (softmax rows), with existence and uniqueness guaranteed by strict convexity. The paper also proves that it is mathematically impossible for any subquadratic attention algorithm (inspecting o(n^2) query–key pairs) to be asymptotically accurate on all inputs via a “needle‑in‑a‑haystack” adversarial construction that breaks global softmax normalization. This equivalence matters because it reframes attention as principled optimal inference and opens attention to tools from optimal transport and information geometry: Sinkhorn algorithms, entropy‑regularization insights, and manifold‑aware learning interpretations. The companion results that the backward pass matches an advantage‑based policy gradient and induces a Fisher‑information geometry give a unified forward‑inference / backward‑learning picture. Practical implications include principled regularization, new algorithmic approximations rooted in OT, clearer limits on subquadratic shortcuts, and opportunities to analyze robustness and interpretability through transport and geometric lenses.
Loading comments...
loading comments...