From silicon to softmax: Inside the Ironwood AI stack (cloud.google.com)

0 points 9 hours ago ago | visit original

🤖 AI Summary

Google announced Ironwood, its next-generation TPU and tightly co-designed software stack built to accelerate training and inference of very large foundation models. Ironwood treats a TPU pod as a single supercomputer: each chip packs eight HBM3E stacks (192 GiB per chip) with 7.4 TB/s peak HBM bandwidth and the full system exposes 1.77 PB of HBM. The chip delivers 42.5 Exaflops of FP8 compute and tiles into cubes (64 chips), pods (e.g., 256 chips) and superpods (e.g., 9,216 chips) connected via a dense 3D torus inter-chip interconnect (ICI) and a reconfigurable Optical Circuit Switch (OCS) fabric. The network can be dynamically rerouted around failures and provisioned into slices for mixed workloads, while liquid cooling and design improvements yield ~2× perf/W over the prior TPU and ~30× vs 2018 Cloud TPU. On the software side, Ironwood relies on XLA plus both JAX and a native PyTorch path to map high-level code to fused kernels that saturate its MXU (matrix) and VPU (vector) units. JAX primitives (jit, grad, shard_map) and libraries (Optax, Orbax, Qwix, Metrax, Tunix, Goodput) enable large-scale training, checkpointing, quantization and monitoring, while PyTorch supports eager execution and torch.compile→XLA for easier porting. High-level frameworks like MaxText (JAX) and vLLM (inference) are optimized for pretraining, RL/finetuning workflows (actor rollouts + learners), and low-latency, high-throughput serving. The result is a highly resilient, energy-efficient platform that lowers friction for scaling models from research to multi-exaflop production runs.

Loading comments...

loading comments...