Infinite scale: The architecture behind the Azure AI superfactory (blogs.microsoft.com)

0 points 12 hours ago ago | visit original

🤖 AI Summary

Microsoft today unveiled a new Fairwater Azure AI datacenter in Atlanta—linked to its Wisconsin Fairwater site and the broader Azure footprint—to create what it calls the world’s first planet-scale “AI superfactory.” The design abandons traditional cloud datacenter assumptions in favor of a single flat cluster fabric that can integrate hundreds of thousands of NVIDIA Blackwell GPUs (GB200/GB300 family) to serve frontier model training and a range of AI workloads (pretraining, fine-tuning, RL, synthetic data) with dynamic allocation and higher utilization. Fairwater packs compute density and low latency via liquid closed‑loop cooling (≈140 kW/rack, 1,360 kW/row), two‑story rack placement to shorten cable runs, and racks of up to 72 GPUs linked by NVLink (≈1.8 TB GPU-to-GPU bandwidth and >14 TB pooled memory per GPU). Scale‑out is provided by a two‑tier Ethernet backend (800 Gbps GPU-to-GPU connectivity), SONiC-based switch stack, and a custom Multi‑Path Reliable Connected (MRC) protocol for advanced congestion control and rapid retransmission. Microsoft also built a dedicated AI WAN (120k+ new fiber miles) to stitch sites into an elastic supercomputer, while using resilient grid power and software/hardware/energy‑storage methods to smooth demand. The result is a more cost-efficient, sustainable, and latency‑optimized platform that enables training ever‑larger models and gives developers more fit‑for‑purpose networking and compute at hyperscale.

Loading comments...

loading comments...