🤖 AI Summary
Graphite needed low-latency code search not just on main branches but at arbitrary commits for its agentic Chat tool. Simple approaches—running git grep on mounted volumes or using EBS/EFS-backed servers—collapsed on very large repos because cold disk reads push performance into the mercy of the OS page cache. Indexing every commit in a document store (e.g., Elasticsearch) was fast for one snapshot but impractical at scale: indexing thousands of repos × thousands of commits explodes document counts and cost.
Instead, the team borrowed Git’s content model: store blobs (file contents) and trees (commit file lists) in a document database (Turbopuffer). A search executes two parallel queries—fetch the commit’s tree and fetch all blobs matching the query—and then streams an in-memory filter that returns only blob IDs present in that tree. This keeps the indexed document count to a small multiple of the working-tree file count (empirically ~3×), enables immediate streamed responses, and avoids fully materializing every commit. The system is production live (tens of millions of files, thousands of repositories) with median latency consistently under 100 ms, supports arbitrary commits without GitHub API limits, and opens the door to embeddings/semantic search and further optimizations like compact precomputed blob-ID sets.
Loading comments...
login to comment
loading comments...
no comments yet