Tiny LLM – LLM Serving in a Week (skyzh.github.io)

🤖 AI Summary
The Tiny LLM course offers systems engineers a hands-on, deep dive into building and serving large language models (LLMs) from scratch using only fundamental matrix manipulation APIs. Unlike complex open-source LLM serving projects that rely heavily on CUDA and low-level optimizations, this course demystifies the underlying mechanics of loading model parameters and performing inference by gradually constructing a serving system for the Qwen2-7B-Instruct model over three weeks. The curriculum starts with a pure Python implementation, then progresses to C++/Metal kernel optimizations, and finally focuses on batching techniques to boost throughput. Significantly, this approach provides AI/ML practitioners and systems engineers an accessible, detailed view into LLM inference engineering without needing high-end NVIDIA GPUs—leveraging MLX, a machine learning library optimized for Apple Silicon—making it more feasible for a broader audience. The course balances technical rigor with clarity by offering a unified notation system for tensor dimensions and by integrating community resources, making it a practical guide rather than a traditional textbook. Created by engineers passionate about understanding LLM internals, Tiny LLM fosters a collaborative learning environment via Discord and GitHub, empowering participants to build performant LLM serving systems grounded in core software engineering principles.
Loading comments...
loading comments...