A Minimal Route to Transformer Attention (www.neelsomaniblog.com)

🤖 AI Summary
The post shows that the Transformer attention mechanism can be derived from a short list of natural assumptions rather than invented ad hoc. Starting from the requirement that a per-position output must be permutation-invariant over past tokens, the author invokes the Deep Sets decomposition to write the output as an aggregation over each token’s embedding plus a relevance score. By further assuming (1) the outer aggregator is the identity, (2) a token’s contribution scales linearly with its relevance, and (3) content vectors are linear transforms of embeddings, attention reduces to computing linear “value” vectors and weighting them by relevance. Relevance is constrained for parallel, GPU-friendly computation, so it’s formed by dot-products between learned linear projections (queries and keys), scaled by 1/√dk and normalized with softmax — yielding exactly scaled dot-product attention and the familiar output y_i = Σ_j softmax(q_i·k_j/√dk) · (W_v x_j). This derivation matters because it clarifies which parts of attention are inevitable under simple symmetry and efficiency constraints (permutation-invariance, separability, linear projections, parallelizability) and which are design choices (identity aggregator, dot-product similarity, softmax normalization). The analysis highlights practical trade-offs — why dot-product + scaling favors parallel hardware and low sequential depth — and points to clear axes for innovation (alternate aggregators, non-dot similarity measures, different normalizations, or ways to reintroduce positional structure).
Loading comments...
loading comments...