Out-of-core LLM inference engine written from scratch in Rust (github.com)

0 points 1 hour ago ago | visit original

🤖 AI Summary

A new out-of-core LLM inference engine, Kortex, has been developed entirely in Rust. This innovative engine allows users to run significantly larger models than a GPU's available VRAM by treating inference as a streaming problem across various memory hierarchies—NVMe, RAM, and VRAM. Remarkably, Kortex can efficiently run Llama-3.3-70B, which has 42.5 GB of weights, on a consumer GPU with only 20 GB of VRAM, achieving a token generation speed of approximately 2 tokens per second. This performance outshines the existing llama.cpp implementation, particularly for models exceeding VRAM capacity. The significance of Kortex lies in its ability to overcome limitations related to partial offloading, which is prevalent in other tools like llama.cpp. By maintaining all compute on the GPU and streaming weights in parallel from RAM and NVMe drives, Kortex demonstrates up to a nine-fold improvement in performance for larger models. Technical features include a residency planner, which optimally allocates weights across different memory types, and speculative decoding, capable of generating token-accurate outputs more efficiently. Currently available only for Windows, a Linux port is planned, reinforcing the potential for broader accessibility in high-performance AI/ML tasks.

Loading comments...

loading comments...