Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism (mlsys.wuklab.io)

0 points 1 hour ago ago | visit original

🤖 AI Summary

Nitsum is a cutting-edge serving system designed to optimize large language model (LLM) deployment by utilizing adaptive tensor parallelism (TP) to handle diverse workload requirements under a fixed GPU budget. Unlike traditional systems that treat TP as a static choice, Nitsum dynamically adjusts TP levels to maximize the number of requests meeting both Time To First Token (TTFT) and Time Per Output Token (TPOT) service-level objectives (SLOs), improving goodput by up to 5.3 times compared to current state-of-the-art systems. This innovation is significant for the AI/ML community as it allows for a more efficient use of computational resources, enabling a single model to effectively manage latency-sensitive interactions and slower background tasks without needing dedicated clusters for each workload type. Nitsum employs low-overhead TP switching, advanced KV cache migration techniques, and a robust scheduling policy to ensure seamless transitions between configurations, making it possible to adapt in real-time to changing demand while ensuring high throughput. This advancement not only enhances performance but also drives down costs, showcasing the potential for flexible and efficient model serving in complex applications.

Loading comments...

loading comments...