Super fast and accurate image classification on edge devices (github.com)

🤖 AI Summary
This repo is a hands-on guide for building and deploying fast, accurate image classifiers that run locally on phones using Visual Language Models (VLMs). It walks through progressively harder tasks—from a Cats vs Dogs classifier (demoed) to upcoming Human Action Recognition and car brand/model/year detectors—and shows how to package the final model as an artifact callable from an iOS app (LeapSDK; Android coming soon). The author demonstrates that modern open-weight VLMs (Liquid AI’s LFM2 family) can deliver strong accuracy on edge devices: LFM2‑VL‑450M achieved ~97% on a 100-sample Cats vs Dogs run, while upgrading to LFM2‑VL‑1.6B boosted accuracy to ~99%. Technically, a VLM is treated as a function that maps (image + text prompt) → text (or structured text), which enables lightweight, offline-first agentic workflows for phones, drones, and embedded systems. The repo includes reusable YAML configs (model, dataset, prompt, seed, label mapping), an image-to-json evaluation pipeline (evaluate.py using Modal for GPU), and Jupyter notebooks that visualize per-sample errors (CSV with base64 images, predictions, labels). Practical takeaways: fix dataset issues or add an “other” class, try stronger models if device permits, improve prompts (or use DSPy/MIPROv2 for automated prompt tuning), or fine-tune for hardest tasks. A notable caveat: VLM outputs can be free-form (e.g., “pug” despite a two-class prompt), so output-format enforcement is an important next step.
Loading comments...
loading comments...