Introduction to Parallelism in PyTorch (ggrigorev.me)

🤖 AI Summary
This post gives a practical, hands‑on tour of PyTorch parallelism—why it matters and how to implement it in real training rigs—focusing on Distributed Data Parallel (DDP) and introducing sharded approaches like FSDP/TP. The author stresses that for serious workloads you need parallelism to get near‑linear speedups, recommends torch.compile as a baseline, and shows how to run distributed code with torchrun + dist.init_process_group("nccl") (or gloo for CPU). The writeup is driven by real implementation patterns: broadcast initial weights, shard batches across ranks, scale global batch size and learning rate (≈√n), and use loss/world_size scaling or summed gradients. Technical takeaways center on communication‑efficient DDP: collective mechanics (all‑reduce), the per‑rank communication cost (each rank sends/receives 2(P−1)/P · N bytes for tensor size N across P ranks) and effective bandwidth ≈ b·P/(2(P−1)), with concrete latency examples (1GB tensor across 8 GPUs: ~7ms on H100, ~3.5ms on B100, ~375ms on PCIe4×16). The post then explains async overlap of gradient sync using backward hooks (register_post_accumulate_grad_hook), dist.all_reduce(async_op=True), collecting handles and waiting (handle.wait()/get_future().then) to hide comms behind backprop. Bucketing (flattening gradients per dtype with torch._utils._flatten_dense_tensors and unflatten) reduces small‑message overhead; a no_sync context and gradient accumulation trade off correctness vs fewer syncs. Finally, FSDP is presented as the next step: shard model weights, grads and optimizer state to save memory and enable larger models.
Loading comments...
loading comments...