A History of Large Language Models (gregorygundersen.com)

🤖 AI Summary
A History of Large Language Models is a deep, readable synthesis tracing the academic lineage behind today’s LLMs, with attention as the throughline. The author starts from the limitations of rule-based and n‑gram (Markov) approaches—where context is lost and the curse of dimensionality prevents reliable probability estimates—and shows how key ideas evolved: distributed representations (word embeddings) and neural probabilistic language modeling (Bengio et al., 2003) that jointly learn embeddings and a probability function, autoregressive next‑word training, attention mechanisms, and finally the transformer (Vaswani et al., 2017) that makes “all attention.” The post is significant because it connects the conceptual dots explaining why modern LLMs generalize: embeddings convert discrete words into smooth vector spaces so similar contexts share statistical strength, and attention/transformer architectures let models condition flexibly on long-range context without restrictive Markov assumptions. Technical implications emphasized include the shift from handcrafted features to learned distributed representations, the centrality of next‑token (autoregressive) objectives, and the “bitter lesson” that simple, scalable methods (e.g., attention + scale) often outperform clever but unscalable tricks. The piece frames LLMs not as magic but as the emergent product of a few scalable ideas refined over decades.
Loading comments...
loading comments...