Hierarchical Retrieval: The Geometry and a Pretrain-Finetune Recipe (arxiv.org)

0 points 22 hours ago ago | visit original

🤖 AI Summary

Researchers studied limitations of dual-encoder (DE) retrieval models for hierarchical retrieval (HR)—scenarios where relevant documents for a query are its ancestors in a taxonomy—and proved an important geometry constraint: DEs can represent HR exactly only if the embedding dimension scales linearly with hierarchy depth and logarithmically with the number of documents. Practically, when DEs are trained in the usual way (on matching query–document pairs) the Euclidean embedding geometry still struggles, producing a "lost-in-the-long-distance" phenomenon where retrieval accuracy drops for ancestors that are farther away in the hierarchy. To fix this, the authors propose a simple pretrain→finetune recipe that reshapes the embedding space to preserve long-range hierarchical relationships without hurting close-level performance. Empirically on a realistic WordNet hierarchy their method raises recall on long-distance ancestor pairs from 19% to 76%, and it also improves retrieval of relevant products on a shopping-queries dataset. The work highlights both a theoretical capacity requirement for DEs in hierarchical tasks and a practical training strategy to overcome geometry-induced failure modes, with direct implications for taxonomy-aware search, product recommendation, and multi-granularity retrieval systems.

Loading comments...

loading comments...