Shimmy v1.7.0: Running 42B Moe Models on Consumer GPUs with 99.9% VRAM Reduction (github.com)

0 points 16 hours ago ago | visit original

🤖 AI Summary

Shimmy v1.7.0 introduces CPU-offloaded Mixture-of-Experts (MoE) support that lets developers run massive expert models on consumer GPUs by automatically moving MoE layers to CPU memory. The headline: 42B-class models (e.g., Phi-3.5-MoE) can be served on 8GB—and in many quantized configurations far below that—by cutting VRAM use by orders of magnitude (reported up to 99.9% theoretical reduction; real-world example: GPT-OSS 20B dropped from ~15GB to 4.3GB, a 71.5% reduction). This unlocks on-prem and laptop experimentation, lowers infra costs, and democratizes access to large-scale MoE architectures for research and production. Key technical details: Shimmy exposes --cpu-moe and --n-cpu-moe flags to auto-offload MoE layers or tune how many layers use CPU, trading 10–100× memory savings for ~2–7× slower inference depending on setup. It ships as a tiny (<5MB) Rust binary with enhanced llama.cpp bindings, cross-platform GPU/MLX/Metal/CUDA support (Linux x86_64/ARM64, macOS Apple Silicon/Intel, Windows), SafeTensors-ready models on HuggingFace, and a validated test suite (295/295). Recommended starting points include Phi-3.5-MoE Q4 K-M (~2.5GB VRAM) and DeepSeek-MoE 16B variants; deployment is simple (cargo install shimmy or platform binaries and ./shimmy serve --cpu-moe --model-path your-model.gguf), enabling practical MoE serving on everyday hardware.

Loading comments...

loading comments...