Show HN: WebPizza – AI/RAG pipeline running in the browser with WebGPU (github.com)

0 points 22 hours ago ago | visit original

🤖 AI Summary

WebPizza is a proof-of-concept that runs a full retrieval-augmented-generation (RAG) document chat entirely in the browser using WebGPU — no servers, APIs, or data uploads. You can upload PDFs, create embeddings locally (all-MiniLM-L6-v2 via Transformers.js), store vectors in IndexedDB (cosine similarity search), and chat against your documents with popular quantized LLMs (Phi-3, Llama 3, Mistral 7B, Qwen, Gemma). The project emphasizes privacy — documents never leave the device — and demonstrates that modern browsers + WebGPU/WebAssembly can host production-like LLM pipelines client-side. Technically, WebPizza ships two inference engines: WebLLM (standard) and WeInfer (an optimized WebLLM fork) with buffer reuse, an asynchronous pipeline and GPU sampling that claim ~3.76× speedups. Models are q4f16 MLC quantized (1–4GB each) and are downloaded/cached on first use; sample throughput ranges from ~2–12 tokens/sec depending on model size. It uses PDF.js for parsing, Angular 20 frontend, and requires a WebGPU-enabled browser (Chrome/Edge 113+, Safari 18+ partial), ~4GB+ RAM and a modern GPU. Caveats: experimental POC, performance and memory limits on consumer hardware, and setup requires enabling Unsafe WebGPU and proper COOP/COEP headers for deployment.

Loading comments...

loading comments...