Running a RAG powered language model on Android using mediapipe (darrylbayliss.net)

0 points 13 hours ago ago | visit original

🤖 AI Summary

A new how‑to shows how to run a Retrieval Augmented Generation (RAG) language model entirely on Android using MediaPipe, demonstrating an on‑device pipeline that combines a local LLM (Google’s Gemma3‑1B int4), a Gecko embedder, and an SQLite vector store. The post walks through installing model and tokenizer files to the device (e.g., /data/local/tmp/slm/gemma3‑1B-it-int4.task, Gecko_256_f32.tflite, sentencepiece.model), creating embeddings from chunked text, and wiring everything into a RetrievalAndInferenceChain so the model can fetch relevant passages at inference time. This is significant because it enables low‑latency, privacy‑friendly, up‑to‑date responses on mobile (reducing hallucinations by grounding outputs), and gives mobile developers a concrete RAG example rather than a conceptual talk. Technically, the sample uses MediaPipe tasks-genai + localagents-rag, a GeckoEmbedder (optionally GPU‑accelerated), and a SqliteVectorStore configured for 768‑dim vectors. Key runtime knobs shown: LlmInferenceOptions (model path, preferred backend, maxTokens), session options (temperature 0.6, topK 5000, topP 1), and RetrievalConfig (topK 50, minSimilarityScore 0.1). The guide emphasizes chunking strategy, storage tradeoffs (models + embedders are large), embedding creation time, and iterative tuning of retrieval/inference parameters to reduce unexpected outputs.

Loading comments...

loading comments...