The RAG Obituary: Killed by agents, buried by context windows (www.nicolasbustamante.com)

🤖 AI Summary
Author and builder of Fintool argues that Retrieval-Augmented Generation (RAG)—the long-standing pattern of chunking documents, embedding chunks, and feeding top hits to LLMs—is becoming obsolete as context windows expand and agent-based systems mature. After years of engineering around RAG, the piece highlights that even sophisticated chunking (hierarchy preservation, table integrity, cross-reference links, rich metadata) can’t overcome the fundamental problem: fragmented context. With early LLMs constrained to 4k–8k tokens (a single SEC 10‑K is ~51k tokens), RAG emerged as a pragmatic hack, but growing context windows and agentic architectures (e.g., Anthropic’s Claude Code agents) threaten to remove the need for chunking/embedding-based retrieval. Technically, the author details why RAG pipelines are brittle: naive 400–1,000 token chunks break tables and cross-references; 1,536‑dim embeddings misrepresent numeric and domain jargon; semantic search confuses “revenue recognition” vs “revenue growth”; and hybrid BM25+embedding systems require complex weighting and reciprocal-rank fusion. Rerankers add 300–2,000 ms latency, extra cost (example: Cohere rerank pricing), and token limits (often 4k), creating cascading failure points across chunking → embedding → BM25 → fusion → rerank. The implication for AI/ML is strategic: teams should prepare for pipelines that rely less on brittle vector search and heavy indexing infrastructure and more on larger-context models, agentic planners, and end‑to‑end reasoning over full documents—changing tooling, evaluation, and operational costs across the stack.
Loading comments...
loading comments...