Beyond Orthogonality: How Language Models Pack Billions of Concepts into 12,000 Dimensions (nickyoder.com)

🤖 AI Summary
Grant Sanderson’s 3Blue1Brown prompt—how can GPT‑3’s ~12,288‑dimensional embedding space encode millions of concepts?—spurred an experiment that combined visualization, optimization, and high‑dimensional geometry. The author reproduced Grant’s attempt to pack ~10,000 near‑orthogonal unit vectors into a 100‑D sphere and found a critical optimization failure: the original loss (loss = sum relu(|dot|)) falls into a “gradient trap” where badly aligned vectors see almost zero gradient, and an optimizer can exploit a “99% solution” (most pairs nearly orthogonal while a small fraction are nearly parallel). Replacing the loss with an exponential penalty (loss = sum exp(20·dot^2)) fixed the optimization but revealed a stricter packing limit—maximum pairwise angles near ~76.5° in that setup—prompting a deeper look at vector‑packing limits and the Johnson‑Lindenstrauss (JL) lemma. The JL lemma gives a formal guarantee: N points can be projected into k dimensions with distance distortion ε if k ≥ (C/ε^2)·log(N). Empirical GPU experiments (N up to 30k, k up to 10k) show engineered projections can push the constant C well below conservative values (4 → ~1 → as low as 0.2), meaning embedding spaces are far more capacious than naive intuition suggests. A simple capacity estimate (Vectors ≈ 10^(k·F^2/1500), with F = degrees from 90°) implies GPT‑3’s 12,288‑D space can represent astronomically many quasi‑orthogonal concepts (e.g., 10^32 at 88°; far exceeding physical counts at modest angles). Practical takeaways: random/Hadamard projections remain efficient for dimensionality reduction, and modern embedding sizes (1k–20k) are likely adequate—the real challenge is learning the optimal geometric arrangement of concepts, not raw capacity.
Loading comments...
loading comments...