Writing an LLM from scratch, part 25 – instruction fine-tuning (www.gilesthomas.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

In this installment of “Writing an LLM from scratch” the author completes the first half of instruction fine-tuning: actually fine-tuning a small GPT-2 model (1024-token context) using Raschka’s recipes and an Alpaca-style prompt template (Instruction / Input / Response) rather than a multi-turn chat format. The post emphasizes why Alpaca’s one-shot structure made sense historically (short context windows) and previews using a stronger model to evaluate the resulting instruction-following behavior in the next post. The write-up highlights several practical engineering points important to anyone fine-tuning or deploying LLMs. A custom collator that pads only to the longest sequence within each batch saves compute versus padding all examples to a global max; the author also points out how real inference services likely route similarly sized requests together to reduce wasted padding. For loss computation, padded target positions are set to -100 so PyTorch’s cross_entropy ignores them — a small but crucial detail. Other notes: RNG/order differences produced slightly different runs, training on an RTX 3090 used ~9 GiB VRAM and ran ~48s (five epochs accidentally showed overfitting after epoch 2), and a couple of minor code fixes (urllib.request import, collate padding logic). The post is a useful, hands‑on look at the nitty-gritty of making small models behave as instruction-followers and the production-minded choices that matter.

Loading comments...

loading comments...