The understated loading design inside Transformers that saves memory (www.stevhliu.com)

0 points 5 days ago ago | visit original

🤖 AI Summary

A new loading design in Transformers has been introduced that significantly optimizes memory usage for large model loading procedures. Traditionally, loading large models, such as a 70 billion parameter fp16 model, would require approximately 280 GB of memory, as two copies of the model are stored temporarily in memory during the loading process. However, with the integration of the PyTorch meta device, Transformers can now create a model skeleton that holds only metadata, allowing for the dynamic loading of parameters without occupying excessive memory. This innovation enables models to be built regardless of the available device memory, drastically reducing peak memory consumption during the loading phase. The technical implementation involves a system where tensors are loaded lazily using lightweight "safetensors" slices that only contain metadata until the actual tensor data is needed. This approach allows Transfomers to handle mismatches between expected and actual parameter layouts by utilizing conversion methods that rename or reshape collected tensors. Additionally, the system supports disk offloading, which can manage tensor locations between CPU, GPU, and disk as needed. This new design not only streamlines large model loading but also paves the way for users to work with models that would previously have been constrained by physical memory limits, ultimately accelerating research and deployment in the AI/ML community.

Loading comments...

loading comments...