Show HN: Beyond geometric similarity in vector databases (www.tuned.org.uk)

🤖 AI Summary
An experimental run of ArrowSpace’s spectral search (“taumode”) on a 1999–2025 CVE corpus (310,841 items encoded to 384‑dim vectors with a domain‑adapted model) shows that spectral similarity can preserve top‑rank agreement with cosine while materially improving long‑tail retrieval quality. Using ArrowSpaceBuilder to produce a Graph Laplacian–derived index and identical preprocessing/candidate pools, the authors compared cosine, a hybrid mode, and taumode (τ=0.62; lower τ like 0.22 looks promising). Metrics: average NDCG@10 (taumode vs cosine) = 0.9886 (per‑query 0.9754–0.9953), hybrid vs cosine = 0.9988, while Tail/Head ratios rose from 0.9114 (cosine) → 0.9394 (hybrid) → 0.9593 (taumode). Build time was ~2,225s on a 12‑core CPU; the test used three analyst‑style queries (e.g., SQL injection in login endpoint). Technically, taumode encodes edgewise graph topology into a synthetic linear index (via the Laplacian) and acts like a “temperature” knob that reweights rather than disrupts geometry, keeping head fidelity but reducing tail decay. Normalized scoring and identical k‑selection ensured fair NDCG comparisons. Implications: spectral indexes can increase robustness to language drift and domain shift in long‑running corpora (useful for cyber‑threat triage and LLM in‑context retrieval), and the persisted ArrowSpace artifact supports apples‑to‑apples τ sweeps. Next steps include larger slices, automated regression checks, parquet storage, and GPU acceleration.
Loading comments...
loading comments...