🤖 AI Summary
AI practitioners use two broad multi-device paradigms to scale Transformers: data (replica) and model parallelism. Data parallelism runs full copies of a model across GPUs so throughput scales roughly linearly with the number of devices — useful for inference-serving and for training larger effective batch sizes by splitting batches across replicas. It shortens wall-clock time (e.g., 10 GPUs can turn 1,000 GPU-hours into ~100 hours) but requires gradient aggregation and weight synchronization across devices. The big constraint is that the entire model (weights, activations, optimizer state) must fit on each GPU.
When a model won’t fit on one device, you split computation across GPUs. Pipeline parallelism slices the model by depth (e.g., layers 1–16 on GPU0, 17–32 on GPU1), passing intermediate activations between devices; it reduces memory per device but introduces “bubbles” (idle periods) that must be mitigated with microbatching/queueing. Tensor (width) parallelism splits linear algebra: column-parallel (split weight matrix B by columns, replicate A; outputs are concatenated) versus row-parallel (split A columns and B rows; each device computes partial C and an All-Reduce sums them). Column-parallel avoids distributed reductions but requires output concat; row-parallel needs All-Reduce but splits the summation work. These trade-offs—memory footprint, communication patterns (activation transfers or All-Reduce), latency, and utilization—drive engineering choices for scaling large-context transformers and maximizing hardware efficiency.
Loading comments...
login to comment
loading comments...
no comments yet