Show HN: PyNIFE. 400-900× speedup for embedding-based retrieval pipelines (github.com)

0 points 13 hours ago ago | visit original

🤖 AI Summary

PyNIFE (Nearly Inference Free Embeddings) introduces tiny, static “student” embedding models distilled to be fully aligned with larger teacher sentence‑transformers so they can act as drop‑in replacements for query embedding. The project claims 400–900× faster CPU query embedding (example: 90.4 µs per query vs ~68 ms for the teacher, ~750× in one test), with benchmarked throughput on an Apple M3 Pro showing ~71,400 QPS for NIFE vs ~237 QPS for the teacher and a corresponding NDCG@10 drop from ~66.3 to ~59.2 on NanoBEIR/msmarco. Models load in milliseconds (demo: 41 ms) and reuse existing large-model document indexes, enabling extreme cost and latency reductions for retrieval—useful for search fast/slow paths, RAG agent loops, edge or lambda deployments. Technically, NIFE initializes a static student by running every tokenizer token through the teacher, then distills in cosine space (not MSE/KL) with a custom ~100k‑token tokenizer based on bert‑base‑uncased, and uses two-stage training (documents then queries). This yields high alignment in cosine similarity but enforces no token interaction, so NIFE can’t model contextual attenuation, negation, or instruction‑conditioned embeddings—limitations that explain the accuracy gap. PyNIFE is available on PyPI and the models are usable directly as SentenceTransformers; the tradeoff is clear: large, nearly inference‑free speedups at the cost of some retrieval quality.

Loading comments...

loading comments...