Writing an LLM from scratch, part 21 – perplexed by perplexity (www.gilesthomas.com)

0 points 3 hours ago ago | visit original

🤖 AI Summary

The author revisits perplexity while working through Sebastian Raschka’s LLM training chapter, correcting an earlier conflation with Shannon entropy. Perplexity is simply the base-of-logarithm exponentiation of cross-entropy loss (with PyTorch’s natural-log loss, use torch.exp), so perplexity = exp(cross_entropy). For a one-hot training target the per-token perplexity reduces to 1 / p_model(target): if the model is certain (p=1) perplexity=1, if it's uniform over a vocab of size V perplexity=V. That’s why perplexity is often called the “effective vocabulary size” the model is uncertain over. Technically, for a true target distribution p_real (not one-hot), perplexity = exp( sum_x p_real(x) * -log p_model(x) ) = product_x (1 / p_model(x))^{p_real(x)} — a weighted geometric mean of per-token perplexities. This shows perplexity depends on both the model’s assigned probabilities and the real-world/label distribution, so it’s not the same as entropy (which measures spread) but a measure of how well the model matches the actual next-token distribution. Practical implications: perplexity is more interpretable than raw cross-entropy, sensitive to one-hot training (label smoothing would change contributions), and across a batch it becomes the geometric mean of per-token perplexities, emphasizing multiplicative rather than additive errors.

Loading comments...

loading comments...