EdgeSync-LLM – KV cache fragment engine for on-device LLM inference (Go/Android) (github.com)

🤖 AI Summary
A new system named EdgeSync-LLM has been announced, designed to optimize on-device inference for large language models (LLMs) on ARM64 Android devices. This engine-agnostic key-value (KV) cache fragment system allows applications to handle LLMs more efficiently by storing and retrieving slices of attention tensors (Keys and Values) using a high-speed approximate nearest-neighbor search. By injecting these fragments directly into the LLM engine's KV cache, EdgeSync-LLM avoids the costly pre-filling process required for every request, significantly reducing inference time from hundreds of milliseconds to just a few, while handling up to 70% of hits through its cache mechanism. This breakthrough is significant for the AI/ML community as it presents a method for dramatically enhancing the performance of LLMs on resource-constrained devices, facilitating greater accessibility and use of advanced AI technologies in mobile applications. The system supports standard LLM frameworks like llama.cpp, MLC-LLM, and ONNX Runtime, and is built with cross-engine compatibility in mind, allowing it to seamlessly integrate with various machine learning models. Key technical features include a two-tier storage system and a sophisticated caching strategy that ensures high efficiency, making it a valuable tool for developers in the AI space looking to improve application responsiveness and user experience.
Loading comments...
loading comments...