Distillation Can Make AI Models Smaller and Cheaper (www.wired.com)

🤖 AI Summary
Chinese startup DeepSeek’s R1 chatbot—touted as matching top-tier models while using far less compute—stirred headlines and a market shock earlier this year, and fueled accusations that the company used “distillation” to siphon knowledge from OpenAI’s closed o1 model. That reaction missed a key point: knowledge distillation is not a novel hack but a mature, widely used compression technique that makes large “teacher” models practical by training smaller “student” models to mimic their behavior. The technique’s visibility in this episode underscores how model compression can reshape competitive dynamics, cloud costs, and demand for AI hardware. Distillation traces to a 2015 paper by Geoffrey Hinton and colleagues that introduced the idea of “dark knowledge”: using a teacher’s soft probability outputs (not just hard labels) to teach a smaller model which wrong answers are less bad. That insight enabled practical downloads of huge models into leaner versions (e.g., BERT → DistilBERT) and is now offered as a service by major cloud vendors. New work shows distillation can transfer chain-of-thought reasoning (Berkeley’s Sky-T1 reportedly trained for under $450), and even black‑box approaches—prompting a teacher and training on its outputs—can yield surprising gains. The takeaway for AI practitioners: distillation remains a fundamental, cost‑saving tool that lowers barriers to deployment and raises questions about IP boundaries and how capabilities diffuse across the ecosystem.
Loading comments...
loading comments...