xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity (arxiv.org)

0 points 3 hours ago ago | visit original

🤖 AI Summary

Researchers analyzed scaling laws for xLSTM — an LSTM-based architecture with linear time complexity in context length — and compared its compute- and data-scaling behavior to Transformers across model sizes (80M–7B) and training budgets (2B–2T tokens). Using both IsoFLOP and parametric fit methods, they evaluated compute-optimal and over-training regimes and explicitly studied how optimal model size depends on context length, a factor often overlooked. They also measured inference-time scaling to capture deployment costs. The key finding is that xLSTM is competitive with Transformers in the billion-parameter regime and scales more favorably as both training and inference contexts grow: because xLSTM’s time complexity is linear in sequence length (vs. Transformers’ quadratic attention), its performance advantage widens for long-context workloads. Practically, this implies better compute-performance trade-offs for long-context LLMs, potentially lower inference latency and cost at large context windows, and a different optimal compute allocation (model size vs. tokens) when context length is a primary constraint. The study’s robust methodology and focus on context-dependent optimal sizing provide actionable guidance for designing and deploying models where long context and inference efficiency matter.

Loading comments...

loading comments...