WebGPU feature detection was not enough to run small LLMs on phones (ludion.ai)

🤖 AI Summary
Recent tests exploring the feasibility of running small language models (LLMs) in mobile browsers using WebGPU revealed significant limitations. Despite feature detection indicating suitable GPU capabilities across devices, actual inference runs often failed. For example, attempts to execute models like Llama-3.2-1B-Instruct on an iPhone 11 Pro Max resulted in page reloads during generation, while running the same model on a Pixel 8a in an in-app browser stalled indefinitely. In contrast, the Pixel 8a completed a run in Chrome, but the long prompt execution took an unacceptably long 76 seconds to return the first token. These findings highlight a crucial disconnect between reported GPU capabilities and real-world performance, emphasizing that simply having WebGPU available or large buffer sizes does not guarantee the ability to run LLMs successfully. This poses significant implications for developers aiming to leverage AI inference on mobile devices. As the AI/ML community continues to seek efficient on-device processing, these challenges illuminate the need for improved runtime environments that can reliably support advanced machine learning models in mobile contexts.
Loading comments...
loading comments...