Show HN: We cut RAG latency ~2× by switching embedding model (www.myclone.is)

0 points 23 hours ago ago | visit original

🤖 AI Summary

MyClone replaced OpenAI’s text-embedding-3-small (1536‑dim) with Voyage‑3.5‑lite at 512 dimensions in their RAG pipeline for personal digital personas. The switch cut the vector DB storage footprint by ~66%, halved retrieval latency (≈2× faster), and reduced end‑to‑end voice latency by 15–20% with first‑token latency down ~15%. Smaller vectors also reduce network I/O and per‑query compute in ANN/cosine searches, directly improving perceived responsiveness in chat and voice interfaces where every millisecond matters. Technically, the win comes from Voyage’s Matryoshka Representation Learning and quantization‑aware training: the early 256–512 dimensions capture most semantic signal rather than being a naive truncation of a larger embedding. That makes 512‑dim vectors competitive with, or better than, larger fixed‑dim models for retrieval while enabling int8/binary quantization and flexible dimension/precision tradeoffs. For product and infra, this means lower storage and compute cost per persona, headroom to add reranking or multi‑step reasoning within latency budgets, and simpler scaling. The change highlights that embedding choice is a core product decision—especially for latency‑sensitive, knowledge‑heavy RAG systems—where model architecture, dimensionality and quantization materially affect UX and unit economics.

Loading comments...

loading comments...