🤖 AI Summary
Context windows have ballooned from ~8K to 200K–1M tokens, but agentic tasks (coding, deep research, iterative web agents) consume tokens even faster than windows grow. Real sessions routinely hit 50–200K tokens just from current files, chat history, edit logs and project context; OpenAI’s Deep Research can burn through 2M tokens per query and generate multi-page reports, costing $30+ per query. That creates both cost and capability failures: agents hit context limits mid-task, lose continuity, and can’t sustain multi-step workflows. Traditional RAG (single retrieval from external storage) is simple and fast for one-shot lookups but fails when the first retrieval misses, caps multi-hop reasoning, and cannot iterate or refine queries.
Memory-based architectures (exemplified by MemGPT) treat the LLM like RAM plus disk: the main 200K token context is managed as fast “RAM,” while a hierarchical external store holds long-term data. When context hits ~70% the LLM writes important items to permanent storage; at 100% old messages flush to recall storage and are retrieved on demand. This enables multiple retrieval attempts, query refinement, paging, and true multi-hop chaining. Empirically, memory systems outperform RAG: document QA (~65% vs <50%), multi-hop chains (100% vs 0%), and long-conversation recall (92% vs 32%). Practical guidance: use RAG for short, speed-critical queries; use memory management for long-running agents, iterative reasoning, and complex research, often combining both with summarization and context optimization.
Loading comments...
login to comment
loading comments...
no comments yet