RTEB: A New Standard for Retrieval Evaluation (huggingface.co)

0 points 18 hours ago ago | visit original

🤖 AI Summary

The MTEB team has launched RTEB (Retrieval Embedding Benchmark) in beta: a new, retrieval-first benchmark designed to measure how well embedding models actually generalize in real-world search settings. RTEB responds to widespread problems with current evaluations — "teaching to the test," dataset overlap with model training, and benchmarks that don’t reflect enterprise retrieval needs — by combining fully open datasets with private datasets that are evaluated centrally by the maintainers. This hybrid design aims to reveal overfitting (large drops from open→private) and give an unbiased view of in-the-wild retrieval performance for RAG, agents, recommendation systems and other search-dependent applications. Technically, RTEB uses established ranking metrics (default leaderboard metric NDCG@10), includes domain-specific datasets (law, healthcare, finance, code) across 20 languages, and enforces efficient but meaningful dataset sizes (≥1k docs and ~50 queries minimum). Private sets come with descriptive statistics and sample triplets for transparency while preventing direct contamination. A simple grouping system (datasets can belong to multiple groups) and continual updates invite community contributions. Implication: RTEB should push model developers to optimize true generalization for retrieval rather than benchmark memorization, giving practitioners a fairer, enterprise-aligned standard for comparing embedding models.

Loading comments...

loading comments...