A 35B MoE on a 16 GB GPU, without the offload tax (www.lucebox.com)

0 points 2 hours ago ago | visit original

🤖 AI Summary

Luce Spark has unveiled a breakthrough for running 33-35 billion parameter mixture-of-experts (MoE) models on consumer-grade 16 GB GPUs without the typical performance trade-offs associated with offloading. By intelligently pinning only the active experts to the GPU and offloading the rest to system RAM, Spark significantly reduces memory usage—Qwen3.6 now requires just 13.3 GiB and Laguna XS.2 needs 14.6 GiB, enabling these models to operate efficiently where they previously could not fit. The integrated self-tuning mechanism allows the system to learn and optimize from live traffic, ensuring that the most relevant experts remain accessible, thus enhancing overall processing speed. This development is pivotal for the AI/ML community as it democratizes access to high-capacity models, allowing more researchers and developers to utilize sophisticated tools without needing expensive hardware. Additionally, the ability to decode information in a single fused graph streamlines operations, maintaining throughput close to the full residency performance—nearly matching the speeds of models that require a 24 GB GPU. Consequently, Spark not only sets a new standard for efficiency in memory use but also illustrates a significant step towards maximizing the utility of existing hardware in AI applications.

Loading comments...

loading comments...