Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput (github.com)

0 points 11 hours ago ago | visit original

🤖 AI Summary

oLLM v0.4.0 is a lightweight Python inference library that makes it possible to run very large models locally on modest GPUs — notably qwen3-next-80B (160 GB bf16) at roughly 1 token per 2 seconds on an 8 GB consumer GPU. The project achieves large-context inference (up to 100k tokens) without quantization by streaming layer weights and key-value caches to/from SSD, replacing in-memory KV caches with DiskCache, and offloading layers to CPU as needed. That means models that normally require hundreds of GB of VRAM can be served with ~5–8 GB of GPU memory plus large SSD storage (e.g., qwen3-next-80B baseline ~170 GB VRAM vs oLLM ~5.4 GB GPU + 162 GB disk). Key technical levers include FlashAttention-2 with an online softmax that never materializes the full attention matrix, chunked MLPs to limit intermediate activation size, layer-by-layer SSD→GPU weight loading, and KV cache offload/load. The trade-off is throughput and heavy SSD/NVMe I/O rather than memory: latency per token is higher but enables local, private, long-context workflows (legal, medical, logs, chat history analysis) on affordable hardware. Supported GPUs include Ampere/Ada/Hopper families; qwen3-next requires a recent dev transformers build. This approach broadens accessibility for large-model inference at the expense of throughput and increased SSD dependence.

Loading comments...

loading comments...