Pretraining with hierarchical memories separating long-tail and common knowledge (arxiv.org)

0 points 6 hours ago ago | visit original

🤖 AI Summary

Researchers propose a memory-augmented pretraining strategy that separates “common” knowledge (stored in a small base language model) from “long-tail” world knowledge (stored in large, hierarchical parametric memory banks). During both pretraining and inference the model fetches a compact, context-dependent memory block from these banks and injects it into the transformer, so a lightweight LM acts as an anchor for general reasoning while the bank holds sparse, rarely-used facts. The paper introduces hierarchical feed-forward memory layers and trains at trillion-token scale, showing the approach can be applied during pretraining or grafted post-hoc onto existing transformers. The results show marked parameter efficiency and hardware-aligned benefits: a 160M-parameter LM augmented with an 18M fetched memory (sourced from a 4.6B-parameter memory bank) matches the performance of a conventional model with over twice the parameters. The authors also scale parametric memories to over 21B parameters and explore optimal memory types and sizes, finding robustness across transformer variants. Implications include much lower inference-time compute/memory for edge deployment, modular updates to world knowledge without retraining the whole model, and a practical path to decouple rare factual storage from core reasoning capacity.

Loading comments...

loading comments...