Opus: Towards Efficient and Principled Data Selection in LLM Pre-Training (arxiv.org)

0 points 46 days ago ago | visit original

🤖 AI Summary

Researchers have introduced OPUS (Optimizer-induced Projected Utility Selection), a novel data selection framework aimed at enhancing the efficiency of large language model (LLM) pre-training. As high-quality public text sources dwindle—a challenge referred to as the "Data Wall"—the focus has shifted from increasing token count to improving the quality of selected tokens. OPUS innovatively scores training data based on its impact as defined by the optimizer’s updates, allowing for more effective and dynamic data selection. This method ensures computational efficiency through the Ghost technique and CountSketch, incurring only a minor 4.7% increase in compute overhead. Significantly, OPUS has demonstrated superior performance across various scenarios, including training large models like GPT-2 on diverse datasets, where it surpassed even standard 200B-token training in efficiency. Notably, in continued training applications, OPUS achieved remarkable results using only 0.5B tokens, compared to 3B tokens typically required, indicating considerable data efficiency gains—especially relevant for specialized domains. This advancement represents a critical evolution in the field, pushing the boundaries for LLM pre-training strategies amid the challenges presented by diminishing high-quality data availability.

Loading comments...

loading comments...