š¤ AI Summary
A nuclear engineering team building an SMR ran a production-grade benchmark of Retrieval-Augmented Generation (RAG) on their multilingual, equation- and diagram-heavy technical corpus (156 queries in English, French, Japanese; interrogative and affirmative forms) to see which "best practices" actually work. They tested chunking strategies (naive vs context-aware), chunk sizes (2Kā40K chars), embedding models (AWS Titan V2, Qwen 8B, Mistral), and retrieval modes (dense, sparse, hybrid) using Mistral OCR for PDF extraction and Qdrant as the vector DB. Key results: AWS Titan V2 gave the best document-level hit rate (69.2% top-10) vs Qwen 57.7% and Mistral 39.1%; naive chunking beat context-aware (70.5% vs 63.8% average); chunk size had negligible effect (2K ā 40K); dense-only retrieval beat hybrid in their tests (69.2% vs 63.5%). They measured Top-10 recall, MRR, and Top-1 recall under a document-retrieval goal (finding the right document, not the exact paragraph).
The takeaway for AI/ML teams: off-the-shelf ābest practicesā and public benchmarks (e.g., MTEB) can be misleadingāevaluate on your actual documents, languages, and query styles. Prioritize embedding-model robustness and retrieval mode over elaborate chunking or micro-tuning chunk sizes. Practically, Mistralās OCR was worth the cost for hard PDFs, Qdrant worked well operationally, and AWS OpenSearch was cost-prohibitive. In short: measure with realistic, multilingual tests; pick embeddings that are consistent across conditions; keep chunking simple and optimize for cost and reliability.
Loading comments...
login to comment
loading comments...
no comments yet