Evaluate Your Own RAG, Why Best Practices Failed Us (huggingface.co)

šŸ¤– AI Summary
A nuclear engineering team building an SMR ran a production-grade benchmark of Retrieval-Augmented Generation (RAG) on their multilingual, equation- and diagram-heavy technical corpus (156 queries in English, French, Japanese; interrogative and affirmative forms) to see which "best practices" actually work. They tested chunking strategies (naive vs context-aware), chunk sizes (2K–40K chars), embedding models (AWS Titan V2, Qwen 8B, Mistral), and retrieval modes (dense, sparse, hybrid) using Mistral OCR for PDF extraction and Qdrant as the vector DB. Key results: AWS Titan V2 gave the best document-level hit rate (69.2% top-10) vs Qwen 57.7% and Mistral 39.1%; naive chunking beat context-aware (70.5% vs 63.8% average); chunk size had negligible effect (2K ā‰ˆ 40K); dense-only retrieval beat hybrid in their tests (69.2% vs 63.5%). They measured Top-10 recall, MRR, and Top-1 recall under a document-retrieval goal (finding the right document, not the exact paragraph). The takeaway for AI/ML teams: off-the-shelf ā€œbest practicesā€ and public benchmarks (e.g., MTEB) can be misleading—evaluate on your actual documents, languages, and query styles. Prioritize embedding-model robustness and retrieval mode over elaborate chunking or micro-tuning chunk sizes. Practically, Mistral’s OCR was worth the cost for hard PDFs, Qdrant worked well operationally, and AWS OpenSearch was cost-prohibitive. In short: measure with realistic, multilingual tests; pick embeddings that are consistent across conditions; keep chunking simple and optimize for cost and reliability.
Loading comments...
loading comments...