🤖 AI Summary
Researchers formalize the problem of choosing reading material to maximize vocabulary gains as a weighted maximum-coverage task: given m books of average length n, score each book by the weighted sum of unique words it contains (weights from corpus-wide word frequencies). Building the global vocabulary and scoring a single book is easy and can be done in linear time O(mn) with hashing. Selecting the best pair of books requires quadratic time, and selecting the best k books is NP-hard—exact solutions for general k grow exponentially—because this is an instance of the maximum coverage/subset selection family.
Crucially, the objective is a monotone submodular function, so greedy algorithms give strong, provable approximations (classic result: greedy achieves a (1 − 1/e) guarantee). Practically this means fast, simple algorithms (e.g., the Python submodlib package) work well at scale. Quality can be improved by spending more computation—block selection (optimizing pairs/triples at each step), look-ahead strategies, or pruning dominated books—but these increase cost and don’t beat the submodularity bound in the worst case. This formulation and its scalable approximations are directly useful across ML tasks—dataset curation, active sampling, summarization and sensor placement—where near-optimal subset selection under budget constraints is required.
Loading comments...
login to comment
loading comments...
no comments yet