Quantifying Long-Range Information for Long-Context LLM Pretraining Data (arxiv.org)

0 points 6 hours ago ago | visit original

🤖 AI Summary

Researchers introduced LongFilter, a data-curation framework designed to make long-context LLM pretraining more efficient by identifying training examples that actually require long-range dependencies. The paper observes that a large share of long-text corpora are predictable from local context alone, so naively training on them wastes compute and dilutes the signal needed to learn extended-span reasoning, summarization, and code tasks. LongFilter flags samples where extended context provides a measurable information gain, prioritizing material that meaningfully benefits long-context learning. Technically, LongFilter compares model predictions under short-context and long-context conditions and quantifies the information gain (e.g., differences in predictive probabilities or loss) to rank or filter samples; examples with large gains indicate essential long-range dependencies. Applied in experiments that extend LLaMA-3-8B’s context window from 8K to 64K, LongFilter-selected data produced substantial improvements on long-context benchmarks including HELMET, LongBench, and RULER, demonstrating better downstream performance and more efficient use of pretraining compute. The approach has practical implications for dataset curation and cost-effective scaling of long-context models: by focusing on high-information spans, practitioners can accelerate learning of cross-document reasoning and reduce wasted training on locally predictable text.

Loading comments...

loading comments...