Collection of LLMs that run well in 32gb VRAM (huggingface.co)

🤖 AI Summary
A curated list of models that run comfortably on a 32 GB RTX 5090 has been published, highlighting options across parameter scales and modalities: a 27B LLM that “runs well on vLLM” (but won’t fit standard sglang), multiple 5B text models (one noted for extreme speed in sglang and another strong at tool calls/instruction-following but prone to hallucinations), and image-text-to-text models at 6B and 12B. One shared example model is easiest-ai-shawn/Phi-4-EAGLE3-sharegpt-unfiltered; notes also mention that smaller image-text models are easy to run while larger models are recommended for heavier tasks. For practitioners this is useful shorthand for picking models that balance capability, latency, and memory on a single 32 GB GPU. Key technical takeaways: inference framework choice matters—vLLM enables fitting larger (27B) weights that sglang cannot, while sglang and “draft” techniques can dramatically speed smaller models (5B). Expect trade-offs between throughput and reliability (fast 5B variants may hallucinate more), and multimodal workloads are supported by 6B–12B image-text-to-text models. The collection helps developers choose the right model+runtime combo for local or edge inference without needing >32 GB of VRAM.
Loading comments...
loading comments...