At Google Cloud Next ’25, Google unveiled Ironwood, its seventh-generation Tensor Processing Unit (TPU), the company’s most powerful, scalable, and energy-efficient custom AI accelerator to date, and the first designed specifically for inference.
Ironwood marks a pivotal evolution in the trajectory of AI and the underlying infrastructure that supports it. It reflects a shift from reactive AI systems that deliver real-time data for human interpretation, to proactive models capable of independently generating insights and context. This leap defines the dawn of what Google calls the “age of inference”, a new era where intelligent agents do more than respond; they anticipate, analyze, and collaborate to produce actionable knowledge.
Ironwood is built to support this next phase of generative AI and its tremendous computational and communication requirements. It scales up to 9,216 liquid cooled chips linked with breakthrough Inter-Chip Interconnect (ICI) networking spanning nearly 10 MW. It is one of several new components of Google Cloud AI Hypercomputer architecture, which optimizes hardware and software together for the most demanding AI workloads. With Ironwood, developers can also leverage Google’s own Pathways software stack to reliably and easily harness the combined computing power of tens of thousands of Ironwood TPUs.
Powering the age of inference with Ironwood
Ironwood is designed to gracefully manage the complex computation and communication demands of “thinking models,” which encompass Large Language Models (LLMs), Mixture of Experts (MoEs) and advanced reasoning tasks. These models require massive parallel processing and efficient memory access. In particular, Ironwood is designed to minimize data movement and latency on chip while carrying out massive tensor manipulations. At the frontier, the computation demands of thinking models extend well beyond the capacity of any single chip. We designed Ironwood TPUs with a low-latency, high bandwidth ICI network to support coordinated, synchronous communication at full TPU pod scale.
For Google Cloud customers, Ironwood comes in two sizes based on AI workload demands: a 256 chip configuration and a 9,216 chip configuration.
- When scaled to 9,216 chips per pod for a total of 42.5 Exaflops, Ironwood supports more than 24x the compute power of the world’s largest supercomputer – El Capitan – which offers just 1.7 Exaflops per pod. Ironwood delivers the massive parallel processing power necessary for the most demanding AI workloads, such as super large size dense LLM or MoE models with thinking capabilities for training and inference. Each individual chip boasts peak compute of 4,614 TFLOPs. This represents a monumental leap in AI capability. Ironwood’s memory and network architecture ensures that the right data is always available to support peak performance at this massive scale.
- Ironwood also features an enhanced SparseCore, a specialized accelerator for processing ultra-large embeddings common in advanced ranking and recommendation workloads. Expanded SparseCore support in Ironwood allows for a wider range of workloads to be accelerated, including moving beyond the traditional AI domain to financial and scientific domains.
- Pathways, Google’s own ML runtime developed by Google DeepMind, enables efficient distributed computing across multiple TPU chips. Pathways on Google Cloud makes moving beyond a single Ironwood Pod straightforward, enabling hundreds of thousands of Ironwood chips to be composed together to rapidly advance the frontiers of gen AI computation.
Ironwood’s key features
Google Cloud is the only hyperscaler with more than a decade of experience in delivering AI compute to support cutting edge research, seamlessly integrated into planetary-scale services for billions of users every day with Gmail, Search and more. All of this expertise is at the heart of Ironwood’s capabilities. Key features include:
- Significant performance gains while also focusing on power efficiency, allowing AI workloads to run more cost-effectively. Ironwood perf/watt is 2x relative to Trillium, Google’s sixth generation TPU announced last year. At a time when available power is one of the constraints for delivering AI capabilities, it delivers significantly more capacity per watt for customer workloads. Google’s advanced liquid cooling solutions and optimized chip design can reliably sustain up to twice the performance of standard air cooling even under continuous, heavy AI workloads. In fact, Ironwood is nearly 30x more power efficient than the first Cloud TPU from 2018.
- Substantial increase in High Bandwidth Memory (HBM) capacity. Ironwood offers 192 GB per chip, 6x that of Trillium, which enables processing of larger models and datasets, reducing the need for frequent data transfers and improving performance.
- Dramatically improved HBM bandwidth, reaching 7.2 TBps per chip, 4.5x of Trillium’s. This high bandwidth ensures rapid data access, crucial for memory-intensive workloads common in modern AI.
- Enhanced Inter-Chip Interconnect (ICI) bandwidth. This has been increased to 1.2 Tbps bidirectional, 1.5x of Trillium’s, enabling faster communication between chips, facilitating efficient distributed training and inference at scale.
A unique breakthrough in the age of inference, Ironwood spurs increased computation power, memory capacity, ICI networking advancements and reliability. These breakthroughs, coupled with a nearly 2x improvement in power efficiency, mean that the most demanding customers can take on training and serving workloads with the highest performance and lowest latency, all while meeting the exponential rise in computing demand.
