🤖 AI Summary
Google AI Edge announced LiteRT-LM, a portable C++ runtime built on LiteRT that makes it easy to run multi-component language-model pipelines across phones, laptops and embedded devices. An early preview (v0.6.1, June 10) adds CPU and Android GPU support; a follow-up (v0.7.0, June 24) introduces Neural Processing Unit (NPU) acceleration for Qualcomm and MediaTek chips via an Early Access Program. LiteRT-LM exposes a C++ API, uses .litertlm model files, supports 4-bit per-channel quantized Gemma models (Gemma3-1B, Gemma3n-E2B, Gemma3n-E4B) with 4k context windows, and ships prebuilt binaries and build instructions (Bazel 7.6.1, optional Android NDK r28b).
Technically significant: the runtime is cross-platform (Android, macOS, Windows, Linux, embedded) and emphasizes hardware-aware optimization and caching to speed repeated loads. Published benchmarks show major throughput gains—e.g., Gemma3-1B prefill rates jump from ~1.9k tokens/s on Android GPU to ~5.8k tokens/s on Samsung S25 NPU (measured with 1,024-token prefill / 256-token decode), demonstrating that NPU acceleration can dramatically reduce latency for on-device LLMs. By combining 4-bit quantization, portable C++ APIs, and backend flexibility, LiteRT-LM lowers the barrier to deploy efficient, low-latency LLMs at the edge while giving developers more visibility and control over the inference stack.
Loading comments...
login to comment
loading comments...
no comments yet