Running LlamaBarn GPT-OSS 20B on my iPhone. Super fast (unedited video) (twitter.com)

0 points 17 hours ago ago | visit original

🤖 AI Summary

Someone posted an unedited video showing LlamaBarn’s GPT-OSS 20B running natively on an iPhone with very low latency, demonstrating real-time interaction that looks “super fast.” The clip highlights that a 20-billion-parameter open-source model can now be executed on modern mobile hardware, not just in the cloud, enabling offline, private, low-latency AI experiences on consumer devices. Technically, this is possible because the model was heavily optimized for edge inference: weights are quantized to low-bit formats (e.g., 4–8 bit) and served with memory-mapped/streaming I/O, and inference uses highly tuned kernels (ggml/llama.cpp-style runtimes or Core ML/Metal-backed execution) that leverage the phone’s CPU/GPU/Neural Engine. The tradeoffs include reduced numerical fidelity and possible accuracy drops versus full-precision models, plus constraints on context window, thermal throttling, and battery drain. For the AI/ML community the takeaway is practical — sophisticated models are becoming portable, lowering barriers for private, offline apps and new UXs while also raising questions about model provenance, safety, and regulatory oversight as powerful LLMs are easier to run on-device.

Loading comments...

loading comments...