28M Hacker News comments as vector embedding search dataset (clickhouse.com)

0 points 228 days ago ago | visit original

🤖 AI Summary

ClickHouse published a large-scale public dataset of Hacker News content: 28.74 million posts with precomputed 384‑dim vector embeddings (generated using SentenceTransformers all‑MiniLM‑L6‑v2) packaged as a single Parquet file on S3. The release includes SQL workflows to create a table, load the 28.74M rows, build an HNSW vector index, and run approximate nearest neighbor (ANN) searches—making it a turnkey benchmark for designing, sizing and evaluating production vector search pipelines on real user-generated text. Technically, the dataset is meant to exercise storage, memory and index-performance tradeoffs: ClickHouse documents recommend sizing exercises and call out example HNSW hyperparameters (M=64, ef_construction=512) while noting index build and load times depend heavily on CPU cores and storage bandwidth. Search uses cosineDistance and example Python snippets show how to generate query embeddings locally with sentence_transformers. The package also demonstrates a retrieval-augmented generative AI demo (LangChain + OpenAI gpt-3.5-turbo) that retrieves relevant posts and summarizes them—illustrating practical RAG pipelines for domains like sentiment analysis, support automation, legal/medical records and meeting transcripts. This dataset is a practical resource for researchers and engineers to benchmark ANN settings, test scaling behavior, and prototype end-to-end retrieval + generative applications.

Loading comments...

loading comments...