BM25 Search in Postgres (www.tigerdata.com)

0 points 5 days ago ago | visit original

🤖 AI Summary

pg_textsearch is a new PostgreSQL extension (preview release) that brings modern BM25 ranking and hybrid retrieval to Postgres—aimed squarely at RAG systems, chat agents, and other AI-native workflows where retrieval quality directly affects LLM outputs. It fixes long-standing weaknesses of Postgres’ ts_rank and Boolean @@ matching by adding IDF weighting, term-frequency saturation, and length normalization so rare but important terms aren’t drowned out by verbose documents. The extension plugs into Postgres’ tokenizer/tsvector pipeline (29+ languages), offers a bm25 index (CREATE INDEX ... USING bm25(...)), a bm25vector type, and SQL-friendly operators (<@> and to_bm25query) so you can compute BM25 scores, filter, join and aggregate entirely inside the database—fully transactional and automatically maintained. The preview uses in-memory segments (fast writes/queries for datasets that fit the configured memory, default ~64MB/index) with plans for disk-based segments, compression, and Block-Max WAND to enable top-k retrieval without scoring every match. pg_textsearch is intentionally focused—providing Elasticsearch-quality relevance without an external search stack—and is designed to pair with pgvector/hnsw for hybrid semantic+keyword search. For AI/ML teams this means simpler architectures, fewer synchronization headaches, and higher-quality retrieval for LLM prompts, with a clear roadmap to production-scale performance optimizations.

Loading comments...

loading comments...