Show HN: Qdrant Vector Aggregator (github.com)

0 points 5 hours ago ago | visit original

🤖 AI Summary

Qdrant Vector Aggregator is an open-source Python library that compresses chunked embeddings in Qdrant collections into document-level vectors while preserving full document text. It groups chunks by any metadata field (document name, ID, category), concatenates chunk text in correct order when an ordering field is detected, and writes a single embedding+payload per document. The result is significant storage and index-size reduction without sacrificing semantic-search quality — queries against the aggregated collection return complete documents with chunk_count, concatenated page_content, and ordering metadata for verification. Technically, the tool offers 14 aggregation strategies (average, weighted_average, PCA, centroid, attentive_pooling, max/min/median, trimmed_mean, soft_dtw, procrustes, etc.), supports Qdrant Cloud and self-hosted instances via qdrant-client, and exposes a single aggregate_embeddings API with options for distance metric, custom weights, and batching (default 100 points/batch). It includes production features (error handling, logging, verification, debug scripts) and a verifier that reports detected ordering fields, how many documents were concatenated, and average content length. Useful for teams wanting cheaper, faster document-level retrieval and easier downstream processing, it’s ready to drop into pipelines with Python 3.7+ and common ML deps (numpy, scikit-learn, python-dotenv).

Loading comments...

loading comments...