Why can't transformers learn multiplication? (arxiv.org)

🤖 AI Summary
Researchers reverse-engineered a Transformer that learned multi-digit multiplication via an implicit “chain-of-thought” and identified why many models fail at this seemingly simple arithmetic. Using logit attributions and linear probes, they show the model does encode the necessary long-range dependencies. Mechanistically, attention heads build a directed acyclic graph (DAG) that “caches” and later retrieves pairwise partial products, effectively implementing an algorithmic computation across tokens. Geometrically, partial products are represented in attention heads as Minkowski sums between digit vectors, while digits themselves are encoded in a Fourier-like basis—representations that are compact and well-suited to arithmetic but are not found in standard fine-tuned models. The paper demonstrates that standard fine-tuning often converges to a local optimum that lacks these long-range structures, explaining widespread failure on multiplication. Crucially, adding an auxiliary loss that trains a linear probe to predict the running sum provides the right inductive bias and enables successful learning of multiplication. This work highlights a concrete pitfall in learning long-range algorithmic dependencies with Transformers and points to practical remedies: explicit inductive biases, auxiliary objectives, and interpretability tools can steer models toward algorithmic representations (DAG-like attention patterns and Fourier/Minkowski geometric encodings) needed for reliable symbolic computation.
Loading comments...
loading comments...