4× RTX Pro 6000 Blackwell on Water, and the One Card That Wouldn't Behave (sabareesh.com)

🤖 AI Summary
A new high-performance rig featuring four RTX PRO 6000 Blackwell GPUs has been adapted for extensive model training tasks through a custom water cooling solution. Each card, pushing 600 W, collectively generates 2.4 kW of heat, making traditional air cooling impractical for long training runs, which can last from hours to days. To tackle this thermal challenge, the builder optimized cooling by converting the GPUs to waterblocks and employing a dedicated cooling loop. Despite a smooth assembly process, one GPU presented issues, indicating it was "falling off the bus" under load due to a missing power inductor, which was accidentally detached during installation. This incident underscores the intricacies of high-performance GPU assemblies in AI training environments, highlighting the importance of meticulous component handling during upgrades. After resoldering the missing component, all GPUs performed flawlessly under heavy workloads, demonstrating stable operation and maintaining low temperatures. The rig now achieves an aggregate of 840 TFLOPS during training and serves as an effective inference endpoint, showcasing the capabilities of water-cooled systems for demanding AI tasks. This successful fix not only resolves immediate hardware issues but also emphasizes preventative measures for future GPU modifications in the AI/ML community.
Loading comments...
loading comments...