Grepctl: Semantic Search for Your Data Lake (github.com)

🤖 AI Summary
Grepctl is a new CLI and programmatic utility that turns heterogeneous data lakes into a single, semantically searchable index by wiring Google Cloud’s AI stack to BigQuery’s vector search. It ingests nine modalities—text/Markdown, PDFs, Office files, images, audio, video, JSON/CSV—using Document AI, Vision API, Speech-to-Text, Video Intelligence and Vertex AI. Processed content is chunked (1000-character chunks with 100-character overlap, paragraph-aware for docs), enriched with metadata (page numbers, timestamps, slide order), embedded with Vertex AI’s text-embedding-004 into 768-dimensional vectors, and made queryable via BigQuery VECTOR_SEARCH for sub-second semantic retrieval. The system exposes multiple interfaces—CLI, web UI, Python API (SearchClient), and ready-made BigQuery SQL functions (search, semantic_search, search_by_source, search_by_date, etc.)—so developers can call simple SQL or programmatic methods while grepctl handles embedding, vector search, and BigQuery wiring (including EXTERNAL_QUERY from GCS). Notable technical details: OCR on scanned PDFs, speaker diarization and long-form audio support (up to 480 minutes), video frame analysis with temporal alignment, and optional LLM reranking. For AI/ML teams, grepctl streamlines multimodal indexing and retrieval at scale, enabling fast, relevance-aware exploration of an organization’s entire data corpus without custom ETL.
Loading comments...
loading comments...