Google Aims to Flip the Script on AI Inference with New Ironwood TPUs (www.hpcwire.com)

🤖 AI Summary
Google Cloud this month is rolling out its seventh‑generation TPU, Ironwood, a purpose‑built inference accelerator designed to capture more of the growing market for serving foundation models. Each Ironwood chip delivers 4.6 petaFLOPS of FP8 throughput (outpacing Nvidia’s B200 at 4.5 PFLOPS, though slightly under GB200’s 5 PFLOPS), packs 192 GB of HBM3E and 7.2 TB/s chip I/O—big jumps over Google’s prior Trillium TPU. Google is offering 256‑chip and 9,216‑chip pods; the largest pod exposes 1.77 PB of shared HBM as a single high‑speed fabric and can peak at about 42.5 FP8 ExaFLOPS. Chips are linked with Google’s Inter Chip Interconnect (1.2 TB/s bidirectional), and Optical Circuit Switching + the Jupiter datacenter network can stitch hundreds of pods into unified clusters spanning hundreds of thousands of TPUs. The practical payoff is lower latency and massive, unified memory for real‑time model serving: Google already uses Ironwood to run Gemini across services like Search, YouTube and Gmail, and claims nine of the top ten AI labs run on Google Cloud. Infrastructure features include third‑gen liquid cooling (roughly ~1 kW per chip at full 9,612‑chip load ≈10 MW), a Cloud Storage “Anywhere Cache” that can cut read latency up to 96%, and software contributions (LLM‑d for K8s‑native distributed inference, vLLM integration) to ease scaling and portability between TPUs and GPUs. For teams deploying large‑scale inference, Ironwood signals tighter integration of high‑bandwidth memory, fabric‑level sharing and network optics to push latency and throughput beyond conventional GPU clusters.
Loading comments...
loading comments...