The Case Against LLMs as Rerankers (blog.voyageai.com)

0 points 7 days ago ago | visit original

🤖 AI Summary

A new benchmark argues that purpose-built rerankers beat general LLMs for production reranking: Cohere’s rerank-2.5 and rerank-2.5-lite are up to 60x cheaper, 48x faster, and deliver as much as 15% higher NDCG@10 than state-of-the-art LLMs. The study tested 13 real-world datasets across eight domains (code, law, finance, medical, docs, conversations, reviews, etc.), three first-stage retrievers (BM25 lexical search, voyage-3-large, and a lightweight voyage-3-lite), and compared rerank-2.5 family models to LLMs including GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, and Qwen 3 32B. Averaged across setups, rerank-2.5 outperformed top LLMs by ~12–15% in NDCG@10; rerank-2.5-lite hit a strong cost/accuracy point ($0.02 per 1M tokens, NDCG@10 ≈ 83.1%), while LLMs ran ~$1.25–$3 per 1M tokens. Key technical takeaways: two-stage retrieval still matters—pairing a strong first-stage retriever (voyage-3-large) with a specialized reranker gives the best ceiling (voyage-3-large + rerank-2.5 reached NDCG@10 ≈ 84.3%). LLM rerankers can help when first-stage retrieval is weak, but they often degrade already-strong rankings and show diminishing returns beyond ~50–100 candidate docs. Long-context single-pass reranking (Gemini 2.0 Flash, 1M-token) underperformed sliding-window reranking by ~22–27%, undermining the value of huge context windows for this task. Bottom line: for production RAG and search pipelines where cost, latency, and consistent accuracy matter, small cross-encoder rerankers remain the practical choice.

Loading comments...

loading comments...