Scalable In-Context Ranking with Generative Models (research.google)

🤖 AI Summary
Researchers introduce BlockRank, a method that makes In-Context Ranking (ICR) with generative LLMs both scalable and accurate. ICR asks an LLM to rank candidate documents by including the task, query, and documents in the prompt, but attention costs grow super-linearly with context length. The paper identifies two exploitable attention structures in finetuned LLMs: dense attention inside each document block but sparse attention across documents (inter-document block sparsity), and strong correlations between certain query-to-block attention scores and actual relevance (query-document block relevance). Leveraging these, BlockRank enforces blockwise sparsity in the model’s attention (moving complexity from quadratic to linear) and uses an auxiliary contrastive fine-tuning objective to amplify attention signals toward truly relevant blocks. Empirically with Mistral-7B on BEIR, MSMarco and Natural Questions, BlockRank matches or outperforms state-of-the-art listwise rankers and controlled fine-tuned baselines while dramatically cutting inference cost (about 4.7× faster for 100 MSMarco documents). It also scales to long shortlists—around 500 documents (≈100K tokens) in-context—serving rankings in roughly a second. The result is a practical, architecture-aware approach that preserves LLM retrieval quality while enabling real-time, large-shortlist ICR by reshaping attention and directly training relevance-aware block interactions.
Loading comments...
loading comments...