Do embeddings spaces behave like metric spaces? (www.testingbranch.com)

🤖 AI Summary
A recent analysis tested whether common embedding spaces actually behave like metric spaces and showed they often don’t — with real consequences for nearest-neighbor retrieval and RAG systems. The author reminds us cosine “distance” is not a true metric (it can violate the triangle inequality) and that many vector indexes (and especially metric-index pruning) rely on triangle-like behavior; even graph-based methods such as HNSW/FAISS depend on neighborhood consistency (closer points should lead to even closer neighbors). They ran experiments on two corpora (noisy short “food” snippets and cleaner “medical” abstracts) using three embedding variants: DistilBERT (768d), MiniLM (384d), and an aggressively compressed MiniLM → PCA 64 → 4-bit quantized version. For each anchor i they looked for violating triplets (i,j,k) where d(i,k) > d(i,j)+d(j,k)+τ, measuring clean_frac (fraction of anchors with no violations) and using the SMT solver Z3 to efficiently prove or find violations instead of brute force. Results: raw embeddings produced coherent clusters (medical especially stable), but PCA+quantization catastrophically collapsed geometry — clusters overlapped, clean_frac plummeted even with generous τ (0.1), and neighborhoods disintegrated by k≈10. Noisy domain data also weakened geometry. Implication: nearest neighbors aren’t guaranteed to be meaningful after compression or domain mismatch; retrieval/rerank pipelines and vector DB assumptions can break. Practical takeaway: monitor embedding geometry (not just dimensionality reduction or throughput), choose models tuned to your domain, and treat aggressive compression as a risky operational tradeoff for retrieval accuracy.
Loading comments...
loading comments...