Spokesperson: Maurizio Frizziero – Vice President, Chilled Water Systems at Vertiv
Q: Why has thermal chain efficiency become such a critical issue in AI infrastructure?
AI infrastructure has fundamentally altered the thermal equation inside data centers. Unlike traditional enterprise workloads, AI training and inference concentrate extreme compute density into smaller physical footprints, often exceeding 40–100 kW per rack and in some cases reaching hundreds of kW or beyond. This shifts cooling from a background operational concern to a primary constraint on performance, scalability, and energy efficiency.
Thermal chain efficiency matters because every inefficiency is compounded across the system – from chip to rack, row, and entire facility. If waste heat is not removed effectively and predictably at each stage, operators face reduced equipment lifespan, higher power consumption, and limitations on how quickly AI capacity can be deployed. In today’s AI factories, rapid load swings from training and inference cycles create thermal spikes that ripple through the entire ecosystem, making synchronized thermal-power management essential to avoid instability, overprovisioning, or Downtime.
Q: How does Vertiv define the “thermal chain” in modern AI data centers?
The thermal chain encompasses the complete path heat takes from the silicon to the external environment. This includes on-chip thermal interfaces, cold plates or heat sinks, rack-level cooling systems (such as direct-to-chip or immersion), fluid or air distribution, waste heat rejection infrastructure (like chillers and dry coolers), and the control software that orchestrates these elements. Supported by our comprehensive Services offering, Vertiv provides end-to-end support across the entire thermal chain – from initial design and commissioning through to ongoing optimization – to enable continuous reliability through expert deployment and proactive maintenance.
In AI environments, the thermal chain should be treated as an integrated system rather than a collection of discrete components. Optimizing one link in isolation could expose inefficiencies elsewhere, which is why Vertiv emphasizes end-to-end thermal design. This should incorporate a connected thermal-power ecosystem that couples thermal responses directly to real-time IT power draw for coherent, fault-tolerant operation.
Q: What has changed about cooling strategies as AI workloads scale?
Air cooling alone is no longer sufficient for many AI deployments. As rack densities rise to 100+ kW and beyond, liquid cooling has shifted from niche to necessity. Options include direct-to-chip, immersion for extreme densities, rear-door heat exchangers (RDHx) for retrofits and upgrades to meet higher demands, or hybrid architectures combining multiple approaches.
Vertiv supports scalable solutions like in-rack and cooling distribution units CDUs up to 2300 kW (e.g., Vertiv™ CoolChip CDU), immersion systems (e.g., Vertiv™ CoolCenter Immersion at 25–240 kW+), and modular hybrid units (e.g., Vertiv™ CoolPhase Flex combining refrigerant-based air and direct-to-chip liquid cooling). The shift requires rethinking facility design, redundancy models, serviceability, and operational workflows to balance performance with uptime, safety, long-term scalability, and heat reuse potential. Standardized, well-engineered solutions enable phased expansion without disruptive retrofits.
Q: How does thermal efficiency intersect with power availability and sustainability goals?
Thermal inefficiency directly translates into wasted power. In AI data centers, where energy demand strains regional grids, reducing cooling overhead – one of the fastest ways to improve PUE (power usage effectiveness) – becomes critical. Efficient thermal chains enable higher inlet temperatures (e.g., up to 40–50°C supply water), greater waste heat reuse (e.g., for district heating at 35–45°C), and smoother integration with alternative energy sources. High-temperature-capable free-cooling technologies (e.g.,Vertiv™ CoolLoop Trim Cooler) can deliver up to 70% annual cooling energy savings in favorable climates and ~40% space reduction and provides flexibility to address unpredictable conditions.
Centrifugal technology becomes the right choice when the application demands maximum cooling efficiency regardless of ambient conditions, at very high loads, or where lower supply water temperatures are still required preserving more power for AI loads. It provides a stable, scalable backbone for heat rejection when free-cooling opportunities are limited or when operational flexibility and resilience take precedence over the efficiency gains of compressor-less operation.
Solutions can help to reduce carbon footprints by using low-GWP (global warming potential) refrigerants and supporting regulatory paths like EU F-Gas, turning waste heat into a resource while optimizing overall power usage and emissions.
Q: What role does intelligence and automation play in managing thermal chains?
AI infrastructure requires cooling systems as dynamic as the workloads they support. Static designs struggle with fluctuating utilization, bursty loads, and evolving hardware. Vertiv’s unified controls architecture – spanning unit-level intelligence (e.g., Vertiv™ Liebert® iCOM™ with self-healing and protective logic), supervisory orchestration (harmonizing setpoints across zones), and chilled-water plant management (e.g., Vertiv™ Liebert® iCOM™ CWM using machine learning (ML) and digital twins for predictive staging, anomaly detection, and optimization) – enables real-time, load-aware fine-tuning. This reduces energy waste, improves resilience, and responds faster to anomalies.
Platforms like Vertiv™ Unify integrate thermal and electrical data for a single pane of glass, while AI-powered services (e.g., the recently launched Vertiv™ Next Predict) provide predictive maintenance across power and cooling.
Q: What are the most common misconceptions operators have about cooling AI environments?
One common misconception is that higher capacity alone solves thermal challenges. In reality, oversizing without addressing distribution efficiency, control integration, and workload dynamics leads to higher costs and lower performance. Another is that liquid cooling automatically introduces complexity. When designed holistically – modular CDUs, hybrid systems heat rejection with unified controls – it actually simplifies management, reduces airflow constraints, improves predictability, and outperforms pushed air cooling limits.
Operators sometimes overlook the need for connected power-thermal orchestration, assuming silos suffice, yet unsynchronized responses to AI load swings can cause instability or inefficiency.
Q: How should organizations approach thermal planning when designing AI-ready data centers?
Thermal planning should begin alongside compute and power planning, not after. Model future workload density (including GPU roadmaps and burst patterns), understand hardware evolution, and design flexible facilities that evolve without major retrofits.
Organisations should implement architectures that support phased expansion like hybrid air/liquid mixes, modular prefabricated solutions (e.g., Vertiv™ MegaMod HDX or Vertiv™ SmartRun for faster deployment), vendor-compatible components, and monitoring across the full thermal chain. Prioritize adaptability for high-temperature operations, free cooling maximization, and integration with power systems. Flexibility is critical, as AI infrastructure evolves faster than traditional lifecycles, with gigawatt- scale reference designs (e.g., for NVIDIA platforms) compressing deployment timelines by up to 50%.
Q: Looking ahead, how will thermal chain efficiency shape the future of AI infrastructure?
Thermal efficiency will increasingly determine where AI can be deployed, how quickly it scales, and how economically it operates. As compute density rises and AI factories push forward gigawatt-scale resilience, organizations succeeding will treat thermal management – and its tight coupling with power – as a core discipline, not a supporting function. Vertiv’s connected ecosystem spans the entire thermal chain, from chip-level heat capture (direct-to-chip, immersion) to plant-side intelligence (screw/inverterscrew/centrifugal/trim cooler, modular hybrid heat rejection solutions). Unified unit and system controls bring these elements together to unlock performance, efficiency gains and fault tolerance.
Cooling could be a limiting factor in digital growth. Effective planning with integrated, intelligent, hybrid and modular strategies enables AI innovation while addressing energy, grid, and efficiency constraints.
