Long-Context Attention from Kernel Efficiency to Distributed Context Parallelism (arxiv.org)

0 points 5 hours ago ago | visit original

🤖 AI Summary

Transformer attention’s quadratic compute and memory cost with sequence length remains the main barrier to training truly long-context LLMs. This paper presents a unified benchmark that brings together representative attention kernels (both dense and sparse operator-level optimizations) and module-level context-parallel or distributed-attention strategies under a single, modular interface. It addresses gaps in prior work—piecemeal operator comparisons and framework-specific distributed solutions—by providing reproducible, extensible evaluations that let researchers plug in different kernels and parallel schemes and measure real-world behavior. The benchmark evaluates methods along two critical axes: attention mask patterns (which strongly affect operator performance, memory use, and usability) and sequence length combined with distributed scale (how methods behave as context grows and across many devices). Extensive experiments on clusters up to 96 GPUs reveal method-specific trade-offs—e.g., kernel choices that win for short, structured masks may lose on very long or dense contexts; context-parallel approaches can enable extreme lengths but incur communication and memory-rebalance costs. By quantifying these interactions, the work gives practical guidance for selecting attention operators and distributed strategies when designing or deploying long-context LLM training pipelines, helping practitioners balance compute, memory, and cross-device communication.

Loading comments...

loading comments...