Show HN: Running Gemma-4 26B at 124 tokens/SEC on a CPU, no GPU (apeg.dev)

🤖 AI Summary
A developer demonstrated that the Gemma-4 mixture-of-experts model, consisting of 26 billion parameters, can run on a standard CPU—specifically an i9-13900K—achieving impressive speeds without any GPU. This resulted in a performance of approximately 40 tokens per second for single requests, and up to 124 tokens per second when using batch processing. The significance of this achievement lies in its implications for expanding access to powerful AI models on consumer hardware, enabling developers and researchers who may not afford expensive GPU setups to utilize large language models effectively. In this experiment, the developer found that the model's efficiency relied heavily on its architecture, where only a fraction of the parameters (around 3.8 billion out of 26 billion) are activated per token due to its mixture-of-experts design. This sparsity allows the model to fit within CPU limits. Interestingly, the results highlighted that reducing the size of the output head—rather than the experts themselves—yielded greater performance gains, with the output head being crucial for token generation. These insights reveal potential pathways for optimizing large models and make a compelling case for further exploration of CPU-bound AI applications, providing practical guidance for others in the AI/ML community through the openly shared methodology on GitHub.
Loading comments...
loading comments...