The Lost Nuance of Grep vs. Semantic Search (www.nuss-and-bolts.com)

🤖 AI Summary
The debate between “grep” (agentic, vector-less search) and semantic RAG isn’t binary: it depends on the task and data. A toy experiment on Natural Questions showed plain grep over keyword-matched files is simple but slow (latency scaled linearly with index size and was worse than a numpy vector search on a Mac) and brittle — it only finds exact matches. Adding a cheap LLM (gpt-5-mini) to generate relevant keywords boosted grep-style retrieval nearly 10x, exposing that grep excels for known, easily-derived tokens but fails when queries reference oblique or renamed concepts. Agentic search trades indexing and embedding compute for LLM-driven flexibility and simpler engineering (no index, chunking, or extra embedding security surface). Cursor’s real contribution is hybrid nuance: their embedding model is trained on agent traces (file reads, grep steps) where an LLM ranks helpful content, aligning similarity scores to how agents actually solve coding tasks. That implicitly encodes query expansion and reranking behavior (similar to ReasonIR), improving retrieval for code scenarios where names aren’t obvious. The takeaway for practitioners: use grep/agentic approaches for fast, exact-match workflows and prototypes; use embeddings — ideally trained on agent traces or with explicit expansion/rerankers — when you need semantic flexibility, robustness to vocabulary drift, and continuous-domain retrieval. Hybrid pipelines that combine both are the most pragmatic path forward.
Loading comments...
loading comments...