Space-Efficient Data Structures for Top-K Completion (2013) [pdf] (groups.di.unipi.it)

🤖 AI Summary
Hsu and Ottaviano (2013) present three space-efficient, trie-based data structures for top-k auto-completion over very large scored string sets — a common need for web search, social networks and mobile keyboards where memory is limited and latency must be microsecond-level. The paper formalizes top-k completion (return the k highest-scored strings with a given prefix) and shows how to compress both strings and their scores to fit in main memory while still returning completions extremely fast. The methods exploit skewed score distributions, heavy prefix sharing, and score-aware node ordering to prioritize high-score branches. Technically, the three solutions are: (1) Completion Trie — a compressed compacted trie that stores the maximum descendant score at each node and orders children by that score (about 120 bits/query and ~3.7 µs per top-10 on a 10M-query AOL dataset); (2) Score‑Decomposed Trie — a path-decomposed trie using max-descendant-based decomposition to further shrink space (≈62 bits/query, ~8.0 µs); and (3) RMQ Trie — map strings to consecutive lexicographic IDs and use Range Minimum/Maximum Query structures over scores to extract top-k (≈65 bits/query, ~33.9 µs). All approaches are competitive with gzip compression (within ~11% in best case) while supporting practical trade-offs between space and retrieval time, enabling memory-resident, low-latency autocomplete and paving the way for extensions like fuzzy completion on compressed indices.
Loading comments...
loading comments...