768GB Intel Optane DIMMs to run 1T-parameter LLM with single GPU at 4tps (www.tomshardware.com)

🤖 AI Summary
A user on Reddit, APFrisco, has successfully configured a workstation to run a 1-trillion-parameter language model (LLM) known as Kimi K2.5 using Intel Optane Persistent Memory (PMem) DIMMs, achieving an impressive throughput of approximately 4 tokens per second. By employing six 128GB Optane modules for a total of 768GB of memory, APFrisco capitalized on the lower costs of second-hand Optane compared to current DRAM prices. This setup utilized a Xeon-powered workstation equipped with a Hybrid GPU/CPU inference methodology, demonstrating a creative workaround for LLM inference in light of Optane's recent discontinuation. This achievement is significant for the AI/ML community as it highlights the effectiveness of alternative memory solutions for handling large-scale models, particularly at a time when DRAM prices are surging. The Optane's relatively low latency, despite being slower than DRAM, presents a unique opportunity for efficient LLM execution. Furthermore, discussions surrounding the need for memory products bridging the gap between DRAM and SSDs emphasize the potential future of memory technology, especially with advancements like the Compute Express Link (CXL) standard, which could enable affordable, byte-addressable memory for AI workloads.
Loading comments...
loading comments...