🤖 AI Summary
Nexa AI engineers proved you can run a 20B-parameter GPT-OSS model fully offline on a modern phone: GPT-OSS-20B was compressed and executed on a Snapdragon 8 Elite Gen 5 device, achieving ~17 tokens/sec and 2.7s time-to-first-token (with warm-up), using ~6 GB runtime memory and a 5.2 GB on-device model. That performance—claimed to be comparable to the o4-mini baseline—was reached without cloud fallback or model-surgery that destroys reasoning, showing that mobile edge devices can host powerful LLMs for low-latency, private inference.
Two systems made it possible. NexaQuant applies importance-weighted, layer-sensitive quantization (mixed IQ4_NL/Q5_K/MXFP4) to reduce GPT-OSS from ~40 GB in FP16 to 5.2 GB (87% smaller than FP16, ~50% smaller than uniform 4-bit) while preserving reasoning. NexaML is a native C++ inference engine and NexaSDK exposes it to Android via JNI with critical optimizations: zero-copy DirectByteBuffer native pools to eliminate JNI-copy overhead, an asynchronous token-streaming pipeline to keep UIs responsive, hardware tuning (6-thread affinity on big.LITTLE cores, ARM NEON SIMD speeding quantized matmuls), and warm-up routines that cut TTFT from ~4.2s to 2.7s. The work highlights that model compression plus a rebuilt, hardware-aware inference stack—not just faster chips—enables practical, private LLMs on phones, with clear implications for edge AI, latency-sensitive apps, and on-device privacy.
Loading comments...
login to comment
loading comments...
no comments yet