Running 26B and 35B LLMs at Full Speed on €990 of Used Hardware – No Cloud (medium.com)

0 points 4 hours ago ago | visit original

🤖 AI Summary

A recent analysis reveals that running advanced language models, specifically 26B and 35B configurations, is now feasible on a secondhand gaming PC for just €990, eliminating the need for cloud-based solutions. The author benchmarks the models using a cost-effective setup involving an RTX 4070 and a 2070 SUPER, achieving impressive performance rates: 82.6 tokens per second (tok/s) for the Gemma 4 26B and 73 tok/s for Alibaba’s Qwen3.6 35B. This significant drop in entry costs for local inference highlights a pivotal shift in accessibility, as users can now deploy powerful AI models without the steep expenses typically associated with high-end workstations. The findings underscore not only the cost-saving potential of refurbished hardware but also the power efficiency of mixture-of-experts (MoE) models. Despite their larger parameter counts, these models can operate on significantly less power compared to dense models, offering sustainable performance at approximately €0.22 per million tokens generated. The implications for AI practitioners are profound, as this advancement lowers barriers to entry, ensuring privacy and instantaneous response times while allowing for flexible hardware upgrades in the future. With local machine performance rivaling that of much pricier setups, the AI/ML community stands on the cusp of a new era of accessibility and efficiency in model deployment.

Loading comments...

loading comments...