Diverse LLM subsets via k-means (100K-1M) [Pretraining, IF, Reasoning] (github.com)

0 points 18 hours ago ago | visit original

🤖 AI Summary

Researchers released "Stratified LLM Subsets," curated, diverse subsets (50k, 100k, 250k, 500k, 1M) drawn from five high‑quality open corpora for pre‑training (FineWeb‑Edu, Proof‑Pile‑2), instruction‑following (Tulu‑3, Orca AgentInstruct) and reasoning distillation (Llama‑Nemotron). Rather than random sampling, the project uses deterministic k‑means on embedding vectors (Snowflake Arctic‑embed‑xs), running k = M clusters for M required samples and 100 iterations, then selecting centroids as representative examples. Each subset inherits source licenses and is published on Hugging Face for reproducible experiments. The key novelty is combining embedding‑based stratification with a square‑root rebalancing of categorical counts to prevent category dominance. Applied to Llama‑Nemotron, this reduced STEM/math dominance (math proportion fell ~22%) while boosting underrepresented classes dramatically (science +~330%, chat and safety increased by orders of magnitude). Practically, these subsets give ML practitioners compact, diverse training pools for pretraining, SFT and reasoning distillation—useful for fast iteration, ablation studies, domain‑balanced fine‑tuning, and reducing bias introduced by skewed large corpora. The deterministic clustering approach makes replication and controlled comparisons straightforward for AI/ML research.

Loading comments...

loading comments...