DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark (forums.developer.nvidia.com)

🤖 AI Summary
DeepSeek-V4-Flash has successfully been deployed on a two-node DGX Spark setup, achieving a remarkable context size of 200,000 tokens. This milestone marks a significant advancement for the AI/ML community, as it validates the potential for high-capacity models to function efficiently across multiple nodes. The implementation utilized direct QSFP56 200G connections for quick data transfer and leveraged specific configurations, including pinned vLLM commits and optimized runtime flags, to ensure performance and stability. The achievement highlights both the technical capabilities of recent architectures and the importance of collaborative efforts in the AI development community. With reported decoding speeds of around 44 tokens per second on a single stream, scaling to 45 tokens per second with concurrency, the setup demonstrates impressive throughput for large language models. However, challenges remain, particularly in cold start times and long-context handling, which could affect practical applications. The shared insights aim to benefit those working with similar configurations, encouraging further exploration of performance optimizations and concurrency enhancements in large token contexts.
Loading comments...
loading comments...