Quantized LLM training in pure CUDA/C++ (github.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

llm.q is a new pure CUDA/C++ implementation for quantized LLM training aimed at single‑node, multi‑GPU setups. Written in C++20 and targeting CUDA 12+, it uses NCCL for inter‑GPU communication and cuDNN for fast attention, and supports both multi‑process (OpenMPI) and multi‑thread modes. The project demonstrates practical, low‑cost training: examples include fine‑tuning Qwen2.5‑0.5B with bf16 master weights and e4m3 matrix‑multiply precision, and training a 1.5B Qwen model on 10B Climb tokens using 4×RTX‑4090s in ~40 hours (estimated cost <$50 on spot GPU rental). Models are saved in transformers‑compatible safetensors and logs are exportable to JSON/W&B for evaluation with lm-eval. Technically, llm.q exposes fine‑grained controls over dtypes (model, matmul, optimizer m/v), recompute strategies (swiglu, norm, ffn, qkv) to trade compute for activation memory, weight sharding, persistent/offload quantized weights, and many CLI flags for learning‑rate scheduling, checkpointing and batching. Build uses CMake with automatic header downloads; tokenization utilities produce binary token files for training. Runtime metrics include tokens/sec and a GPU “speed‑of‑light” (SOL) efficiency measure; the allocator reporting lets users tune batch size and recompute to maximize throughput on constrained GPUs. The project makes quantized training accessible on workstations while preserving production‑style tooling and evaluation workflows.

Loading comments...

loading comments...