LLM-Deflate: Extracting LLMs into Datasets (www.scalarlm.com)

0 points 15 hours ago ago | visit original

🤖 AI Summary

LLM-Deflate introduces a practical "decompression" technique that systematically extracts structured datasets from trained language models by reversing the models’ lossy compression of training data. Using a hierarchical topic-exploration engine that expands broad domains into detailed subtopics, prompts that explicitly request step‑by‑step reasoning, and JSON-parsable outputs, the author generated 10,000+ curated examples per run from three open-source models (Qwen3‑Coder, GPT‑OSS, Llama 3). The method emphasizes capturing not only factual knowledge but also internal reasoning patterns, producing reusable training items and example/response/reasoning triples; sample datasets are published on HuggingFace. This approach builds on synthetic data and distillation work (Alpaca, Nemotron, Orca) but focuses on systematic coverage of a model’s knowledge space rather than ad‑hoc generation. Key technical enablers are careful prompt engineering, hierarchical topic balancing, parsing and quality filtering, and scalable inference (notably using scalarlm) to amortize thousands of model calls. Practical implications include fine‑tuning other models via transferred knowledge, richer model analysis and debugging, training‑data augmentation for scarce domains, and tracking knowledge evolution across model versions. Limitations remain—costly inference, prompt/quality tuning, and coverage bias—but initial results show decompression is a viable tool for dataset creation and interpretability.

Loading comments...

loading comments...