Demystifying Tensor Parallelism (robotchinwag.com)

🤖 AI Summary
Tensor Parallelism (TP) is a critical deep learning execution strategy that splits individual model layers across multiple devices (sharding) to enable training of larger models and faster runtimes. Unlike Data Parallelism (DP), which replicates entire models per device, or Pipeline Parallelism (PP), which divides model layers into stages, TP distributes matrix operations within layers themselves, maintaining mathematical equivalence without altering model architecture. This approach supports scaling beyond memory limits of single devices but introduces challenges in balancing computation and minimizing communication overhead, which can consume 50-70% of runtime if poorly optimized. A key technical insight is how tensor parallelism handles matrix multiplications, such as those in Transformer feed-forward layers. The weights and inputs can be sharded row-wise or column-wise, each requiring careful synchronization via collective communication operations like All-Reduce or All-Gather to preserve output correctness. The paper highlights a “pairwise sharding” scheme—sharding one matmul column-wise and the next row-wise—that effectively halves communication costs by aligning sharding patterns with activation functions like Gelu, which operate elementwise. Additionally, backpropagation under tensor parallelism requires “flipping” the sharding scheme for gradient computations (row-wise forward becomes column-wise backward and vice versa), ensuring gradients are correctly aggregated across devices. This detailed exposition demystifies the nuanced trade-offs in tensor parallelism strategies, offering the AI/ML community deeper understanding for designing scalable and efficient distributed training systems. The focus on concrete sharding schemes, communication patterns, and their impact on forward and backward passes provides essential guidance for optimizing large-scale model training.
Loading comments...
loading comments...