Show HN: Efficient LLM fine-tuning with 65% less VRAM without quantization (github.com)

0 points 6 days ago ago | visit original

🤖 AI Summary

peftee is a new lightweight Python library (built on Hugging Face Transformers and PyTorch) that enables efficient fine-tuning of large language models on low-VRAM hardware — claiming roughly 65% VRAM savings (e.g., 7.6 GB vs 21.8 GB) so you can fine-tune models like Llama 3 8B on an 8 GB GPU without quantization. The author reports modest speed loss (example: ~9s per 200 samples at 2k context length) while still supporting practical workflows (LoRA adapters, gradient checkpointing, optimizer-state offloading experimental) and end-to-end inference via the oLLM inference library. Technically, peftee combines parameter-efficient fine-tuning (only updating the last ~4–8 transformer layers with LoRA) with aggressive offloading strategies — using SSD/CPU offload for model weights and optimizer states, gradient checkpointing, and FlashAttention-2’s online softmax so the full attention matrix is never materialized. That mix reduces GPU memory pressure without resorting to quantization and works across NVIDIA, AMD, and Apple Silicon. For practitioners, this makes style/behavior adaptation (rather than adding factual knowledge) far more accessible on consumer hardware and lowers the barrier for experimentation, rapid iteration, and small-scale production tuning.

Loading comments...

loading comments...