Sampling in Large Language Models (www.aiunpacked.net)

🤖 AI Summary
An engineer who interviewed at an LLM company published a compact primer on "sampling" — the set of techniques that make large language models generate varied, creative outputs rather than the same deterministic reply every time. The piece explains why sampling is needed (models naturally produce the same argmax token), how logits are converted into probabilities via softmax, and how choices like greedy decoding, temperature, Top‑K, Top‑P (nucleus) and constrained sampling trade off predictability, diversity, cost and correctness. It also covers stopping criteria (max tokens vs. special end-of-sequence token) and points out practical tooling (e.g., vLLM) and a hands-on Python notebook for experimentation. Key technical takeaways: at each step the model scores every token with a logit and softmax turns those into a probability distribution; temperature scales logits to flatten (>1) or sharpen (<1) that distribution; greedy decoding picks the highest-probability token every time while sampling draws from the distribution to increase diversity. Top‑K limits sampling to the k highest logits to reduce compute; Top‑P selects the smallest set of tokens whose cumulative probability exceeds a threshold for more adaptive diversity. Constrained sampling enforces grammar/format validity (useful for JSON/SQL) and can speed generation but may hurt performance on some reasoning tasks.
Loading comments...
loading comments...