The global AI race is being fought on multiple fronts- in semiconductor fabs, in hyperscaler boardrooms, and in the algorithmic depths of foundation model laboratories. But increasingly, it is being decided in a far less glamorous place: the cooling infrastructure beneath the racks. As GPU thermal design points vault from 700 watts to 1,400 watts to a projected 2,300 watts per processor in the span of just a few years, the question of how to remove that heat reliably, efficiently, and at scale has become one of the defining engineering challenges of the decade.

Into this white-hot arena steps Refroid, an India-headquartered deep-tech startup that is doing something the global data center industry has largely overlooked: building liquid cooling infrastructure that is not just capable of rejecting heat, but intelligent enough to understand the workloads generating it. Sanjana Mandavia, CEO at Compute Forecast, sat down with

Satya Bhavaraju, Co-founder and CEO of Refroid Technologies, for a wide-ranging conversation spanning the state of the data center paradigm, the engineering complexity behind Refroid’s flagship products, the Thermlon Hybrid Load Bank and the SentraFLO CDU– India’s moment on the global infrastructure map, and where the thermal management industry is headed over the next decade.
Here is the conversation in full
Sanjana Mandavia: Before we get into the products, how broken is today’s data center paradigm in the AI era?
Satya: The data centers are absolutely critical. Stating the obvious, but demand from AI, whether it’s training or inference or agentic is driving massive demand in compute, and the compute needs to be hosted somewhere. Data centers, with their power, cooling, and network infrastructure, are the places where this compute is hosted. So we are going through an unprecedented, once-in-a-generational boom cycle for the data center infrastructure, and how AI hosted in these data centers would be transforming society, businesses, learning, healthcare, and life in general as we have known in the last 50 to 60 years.
Sanjana Mandavia: Are we fundamentally underestimating the cooling challenges as compute scales?
Satya: The reason I say that is the velocity with which the compute, if you look at what has been happening, particularly on the GPU side, the velocity with which Nvidia, for example, and others too, have been launching their new GPUs. Starting, let’s say, with the H100s and H200s that went up to 700 watts per processor, moving to the Blackwell generation where some configurations go up to 1,400 watts per processor, just in the space of two and a half years. And now, even as Blackwell is shipping mainstream, two weeks ago we saw the official announcement of the Rubin architecture that could take the power densities and thermal design points, as they call them all the way to 2,300 watts per processor.
Traditionally, the cooling industry has never encountered such dramatic step-function increases in the energy consumption of the processors, and therefore the adoption of the appropriate technologies to cool. This acceleration in the cooling demands is happening far too rapidly, and I believe it’ll continue to be so for the next five to six years as new process nodes and new architectures come out. Naturally, we’ve moved away from air cooling or are moving away from air cooling rapidly. In the next two years, data centers particularly focused on AI may not have any air-cooled sections; it’ll all be completely liquid cooled, perhaps by the 2028 or 2029 time frames. But even within liquid, the changing TDPs are making it extremely challenging, and the liquid cooling challenges are still underappreciated unless you’re in the liquid cooling industry and trying to build solutions chasing these ever-growing TDPs.
Sanjana Mandavia: What gap did you see in the market that the hyperscalers and traditional OEMs were not solving?
Satya: So you’re broadly referring to two segments of the market perhaps they’re a little more nuanced. The hyperscalers and neo-clouds are one breed of customers, mostly in North America, mostly building their own data center capacity by themselves. But as the AI capacity moves out of North America and people are making deals with the colocation providers in other parts of the world specifically in Asia Pacific and the Middle East the expertise that is available with the hyperscalers and the clouds, neo-clouds such as Crusoe or Coreweave, is very difficult to communicate and translate to the hosts, the colocation providers, in the rest of the world. And that’s the transition through which the industry is going right now. Many of the countries, including India for example, are building liquid-cooled data centers for the first time. I cannot cite instances, but there have been known instances of extraordinary delays to the tune of six months or seven months in actually going live with all the infrastructure in place, but just because the cooling has not turned out the way you thought. There have been delays which means revenue loss, which means an opportunity cost for the hyperscaler, and revenue loss for the colocation provider.
So those challenges exist. But when it comes to enterprise when you’re talking about compute vendors like HP, Dell, Lenovo, Super Micro they don’t have in-house liquid cooling solutions, but they have been slightly more perceptive in disseminating the technologies and the different designs that have been built elsewhere. They’re able to deploy them with a little more ease when it comes to the rest of the world, similar to what they have done in North America.
Sanjana Mandavia: Most of the global players are already investing in liquid cooling. What exactly was missing that required Refroid to build something entirely new?
Satya: Fundamentally, we’re at the very beginning of the curve. Think about it as the first car was being put together, the first automobile, the Model T from Ford or similar attempts. People wanted an internal combustion engine that would take them from point A to point B. People didn’t think of passenger comfort to the extent that we think about today. More importantly, people never took into account fuel efficiency, mileage, and these types of things. So that’s the stage of the liquid cooling industry we are at. We are looking at it as: there is this amount of heat that needs to be cooled. Let me put together a system that can take this heat away. The actual energy efficiencies, which come from the power of the algorithms, which come through observing the workloads for a period of time, these kinds of things have still not penetrated into people’s thinking, and we have not seen any offerings yet that tackle these things.
If you go back to the very well-publicized example when Google acquired DeepMind, in 2017 they published a white paper that spoke about driving 40% energy efficiencies in their air-cooled data centers using DeepMind. What they did was feed DeepMind two years of data across 500 parameters and asked DeepMind to come up with what knobs to turn, what things to tweak, so that they could manage and deliver greater energy efficiency. Now, this is one extreme example where it’s essentially the data center infrastructure management software that would be able to do these things. DCIM software today offers simple dashboards, but even in cooling, if you’re able to understand the rack densities, if you’re able to understand specific server densities and respond accordingly from the CDU side, I believe that there is a huge amount of savings that can be had and we have not, as an industry, entered that era. We’re putting in a plain-vanilla solution: this much heat, I need this much flow rate, this much pressure, and I cool that heat and take it away to the primary side for the ultimate heat rejection.
Sanjana Mandavia: The Thermlon Hybrid Load Bank simulates both air and liquid side load simultaneously, something that traditional systems cannot do. How difficult was it to engineer this product?

Satya: So as an industry, we have evolved rapidly and all the innovation has been focused on the infrastructure products which are the CDU, the secondary fluid network, the innovation in the chillers, and so on and so forth. One extremely important aspect that has been glossed over by the industry is: how do we commission these liquid-cooled data centers, or AI data centers? Traditionally, in the enterprise era, we had something called air load banks which would simply mimic the air-side thermal loads and that’s it. Testing is done, your data center is ready for releasing it to your customers or going live. But when it comes to AI data centers, where part of the workloads are cooled with air and a large chunk sometimes 50% or more of loads are cooled with liquid, and you have a very long secondary fluid network which is precision plumbing that carries the coolant at a particular flow rate to every single rack, this entire old style of commissioning the data centers has somehow just been carried forward.
Today, to commission an AI data center, the status quo is still: you bring in an airside load bank, test the airside separately; then you bring in something called a bulk load bank to test the CDU separately; then you bring in something called TAB kits Test and Balancing kits to test the secondary fluid network. All these things happen in silos. So you really don’t know how these things would work together. You’re still speculating by drawing curves: this is my airside thermal dissipation curve and this is my liquid-side curve. But how do they actually work together?
So we decided that this is not the way AI data centers should be tested and commissioned, because the compute cost of each rack of NVL72 based on Blackwell 300 would cost $3 million. If you have 100 such racks, you’re talking about $300 million in a single data hall. That’s too enormous a compute investment for you to take the readiness of the infrastructure for granted. Just because you’ve designed it in a certain way doesn’t mean it’s going to work that way. We built these rack emulators, what we call ThermIon Hybrid Load Banks, that can simultaneously mimic the airside thermal, the liquid-side thermal, as well as the pressure drops in the secondary, or the hydrodynamic characteristics to use the right technical term of the secondary fluid network. All of these three simultaneously. This ensures: number one, that you’re actually doing an L5 IST an Integrated Systems Test as per ASHRAE’s guidelines; number two, you’re testing the entire infrastructure from rack all the way to the rooftop in one go; and number three, you’re shrinking the time it takes to commission, as opposed to testing air, liquid, and SFN individually and sequentially.
We didn’t stop there, because we wanted to advance the commissioning paradigm further. In the enterprise world, you usually have your databases, ERPs, or mail servers all possibly running at around 27% steady-state workloads, with no sudden spikes or troughs. In the AI world, it’s extremely different. Training workloads are different from inference workloads, and are different from agentic workloads. There are too many thermal transients, and if you don’t stress-test your infrastructure to account for the nature of workloads, then the entire commissioning process is just a perfunctory exercise not mimicking how the data center would need to behave once it goes live. So we built this powerful software platform called Calori that can mimic all types of workloads, and it is highly configurable. For example, if you were building your data hall or a pod inside the data hall to handle a specific workload, you can configure Calori to test that pod for those exact workloads. If you’re building it for a Claude work load, you could do it exactly for that workload or combine all these types of workloads. We could do all these types of testing using Calori, which is highly configurable. This we believe is a true representation of what kind of demands will be placed on the thermal management infrastructure inside a data center, and therefore this is the absolute right way for the industry to test the infrastructure before the very expensive compute is brought in and turned on.
Sanjana Mandavia: Refroid is bridging the gap by simulating both the environments so how accurately can it replicate real AI workloads, especially when the GPU clusters are pushing extreme thermal loads?
Satya: From a technical perspective from a technology standpoint and engineering view these are digitally controlled loads, so you can go as granular as one watt to mimic to that extent. What needs to be tested is a discussion between the tenant in a hyperscaler environment and the commissioning agent who is responsible for the commissioning testing of the data center, and Refroid, who is going to actually do the configuration for them. The software platform that we have engineered is capable of mimicking any type of workload or a combination of types of workloads without any limitation. It’s only up to the specific requirements of the tenant, the colo, the commissioning agent they need to give us what they want, and we can script all of that. Once the scripting is done and there’s no programming involved, it’s just configuration. You add them to a playlist and click the play button, and the tests start getting executed automatically.
You could do the test at an entire data hall level, at a pod level, or within the rack at a server level because all these racks, the way we have architected these ThermIon Hybrid Load Banks, is that every single server-mimicking element we call that a Cartridge is addressable. It has an IP address of its own. So you can control as granularly as you want to. This, I believe, presents true capabilities for someone who knows what kind of workloads are going to be run, and equally for someone who doesn’t know what kind of workloads are going to be run and wants to test for all kinds of possible scenarios so that you’re ready when the compute comes in and when the workloads start running on that compute.
Sanjana Mandavia: From a hyperscaler’s lens, how does this reduce commissioning risk and development timeline?
Satya: Usually, commissioning a data center takes anywhere between two months to three months in some cases longer if you find any problems. The deployment of these traditional airside load banks and bulk load banks to test the CDUs itself requires some very specialized plumbing and takes a very long time to set up. By taking the rack emulator approach, where we simply bring in a rack, I don’t need any special plumbing for these ThermIon Hybrid Load Banks. I use the existing connectors from the secondary fluid network. I use the same power whips that are already set up by the data center. The only additional thing I’m doing is plugging in a network cable and connecting it to my command and control center via a switch. So the test setup time, which used to take weeks previously, can now be completed in a matter of two to three days. And the duration of the test itself, which is a long lead time in the entire commissioning process we can compress to seven to ten days, depending on the extent of testing that the hyperscaler wants to do, maybe two weeks or three weeks.
So you’re essentially saving at least a couple of months in the data center go-live timeline, which is additional revenues for the data center and availability of services for people who utilize that hyperscaler’s services. We are cutting down the cost, cutting down the time, and most importantly, ensuring the integrity that the infrastructure being created is capable of hosting these workloads. It’s almost like an insurance that you’re taking when you test with Refroid as opposed to the status of older methods of testing.
Sanjana Mandavia: And is this a risk-free deployment?
Satya: Correct. How quickly can the data center go live? You build all the infrastructure, then you want to validate the infrastructure and that validation time is what we are compressing, by more than 50%.
Sanjana Mandavia: Certainly, because this is a risk-free deployment. But in reality, even hyperscalers are struggling with unexpected thermal hotspots post-deployment. So is this simulation enough, or are we overestimating how predictable AI workloads are?
Satya: We will not be able to predict with certainty what type of AI workloads would ultimately be running on a specific deployment and allow me a 60-second detour here. If you just looked at the new announcements between Amazon and Cerebras, and the new product architectural announcement of Rubin two weeks ago, what is this indicating? That the inference paradigm we were visualizing all along is not the same as what we had envisioned. The prefill and decode require very different kinds of architectures. Prefill is extremely compute-intensive and decode is extremely latency-intensive. Prefill is the input you’re providing the prompt it could be 100,000 lines of code that you’re feeding in. Decode is sequential, token by token that’s the output that’s coming. So with all these new types of hardware and semiconductor architectures that are coming out, you want your infrastructure to be capable of hosting any of these types of workloads. And just last week, we have seen an AGI CPU announcement from ARM, specifically purpose built to handle agentic workloads along with a 200 kW agentic CPU racks that will sit next to GPU racks.
So infrastructure needs to be prepared. There is no better way to test every corner case in terms of the spikes: Are you going to go from 0% workload to 70% workload, stay there for 15 minutes, then suddenly jump to 90%, stay there for 2 hours, suddenly jump to 105% because you’ve had a thermal excursion, then drop all the way to zero? And this kind of cycle if it occurs over and over again 24/7, 365 is your thermal infrastructure capable of managing these rapid transitions? That’s what will come out in the Calori command-and-control center test reports. If you have 5,000 test cases, how many have passed? Which cases have failed? There is a traceability it can take you back to the kind of case that you have created so you can decide with your design engineers what specific things you should have addressed, whether it’s in your secondary fluid network or in the CDU algorithm itself, that would help you ensure these failures don’t occur as fatal failures in actual runtime.
The infrastructure needs to be tested. Infrastructure for such a high-stakes game when people are pouring in billions of dollars into compute and the land, building, power, network, and cooling commissioning is an extremely critical part that is being ignored. And allow me to go a step further and say this: the way all these large global vendors without naming anybody, because that’s the industry practice test their CDUs, coolant distribution units which are at the heart of the entire thermal management system, is in their factory. The final testing that a CDU undergoes what they call a Factory Acceptance Test is simply connecting a bulk load bank. If you’re building a 1-megawatt CDU, you’re connecting it to a 1-megawatt load bank and trying to see if that 1 megawatt of heat exchange is happening. If it’s happening, this CDU is good to ship.
This is akin to the automotive example I was giving where you’re just putting together an internal combustion engine, attaching some wheels and some brakes and a few other things, oversimplifying, getting from A to B, and saying the job is done. The way we are building our CDUs is different we are testing them with our load banks and characterizing our CDUs for all the extreme types of load conditions I’ve spoken about, and we have ample test data before going to our customers and telling them: our CDUs are validated for extreme AI workloads and all types of scenarios you can think of. We believe it’s going to usher in a new era of AI-proven CDUs, not just CDUs that can reject 1 megawatt of heat. That’s a transformational shift that we hope to lead and pioneer with our SentraFLO line of CDUs that can go from anywhere between 200 kilowatts all the way to 2 megawatts. These would be the industry’s first tested, AI-proven coolant distribution units.
Sanjana Mandavia: Moving next toward the SentraFLO CDU India’s first breakthrough and arguably even more significant from a global standpoint. You said this is India’s first indigenously developed liquid-to-liquid CDU. How big of a milestone is this not just for Refroid, but for India?

Satya: If you look at the entire data center ecosystem, we don’t do the GPUs. We may assemble the servers, but mostly it’s just assembly, even if they’re manufactured in India so there is no real intellectual contribution. And with due respect to several people who may be contributing to the data center ecosystem on the hardware and infrastructure side, India has been a laggard. We import everything that goes into a data center. All the key technologies that go into a data center there may be some passives and similar components still manufactured in India but one of the drivers for us is that this industry is transforming societies. The scale of investment this industry is attracting across the globe, and even in India, is tremendous. If we are going to be building 7 gigawatts of capacity in the next five to six years and we have to import all the key technologies that go into building this infrastructure, it’s kind of a shame on us because we don’t lack talent, we don’t lack expertise. There are phenomenal software engineers and hardware engineers here. India is an engineering factory that contributes a significant chunk of global talent.
So our goal was to make sure that we build digitally sovereign infrastructure and actually take these products global to put India on the global data center infrastructure roadmap. We did it first with liquid immersion cooling. In fact, we’ve just received our first export order. I cannot name the customer for our liquid immersion cooling solutions, which are the first of their kind in India and the first to be exported from India. We’re hoping the ThermIon Hybrid Load Banks will be the second piece of purpose-built data center hardware that will hopefully be deployed next quarter with some North American customers, followed by customers in the Middle East and Europe, besides India. And SentraFLO we’re in the process of qualifying this for some global customers and hyperscalers. It is truly a proud moment for India that we’re actually able to make such complex products for global markets and put a Made in India stamp on those products.
What fundamentally differentiates SentraFLO from global CDU players in terms of design intelligence and cost, let’s talk about these two things separately. On design intelligence, as I explained in response to one of your previous questions, SentraFLO CDUs would be the first CDUs in the industry specifically validated for all types of AI workloads not just for heat exchange capability of 1 megawatt or 2 megawatt, which is what all the other CDUs are currently designed and tested for. Where does this intelligence come from? From the algorithm as you’re witnessing various types of workloads, various racks behaving differently, specific servers inside the racks behaving differently how is the CDU able to respond? By having actual test data thankfully because of our load bank product we are able to mimic these various scenarios, including extreme ones, and see how to fine-tune the algorithm so that we continue to deliver better energy efficiency, or partial PUE, at a CDU level.
In fact, our testing also includes connecting multiple CDUs in a ring topology. What happens if one CDU fails? How quickly are the other CDUs able to pick up the slack and gracefully recover so that there is no impact on the functioning of the data center and the compute? This design intelligence and the ownership of the algorithm has taken us almost one and a half to two years to build, and we are now ready to take it to prime time and to a global audience.
The current generation of hyperscale clusters at least from a couple of hyperscalers where capacity is scaling, the CDU expectations in India seem to be around 600 kilowatts as the sweet spot. Of course, this doesn’t mean it’s globally uniform and standardized. I’m talking about two of the top four hyperscalers who both seem to be gravitating towards 600 kilowatts for their India deployments. We are seeing a slightly different specification in RFPs coming out of the Middle East, where they’re looking at 900-kilowatt CDUs, different design philosophies, and a different hyperscaler’s RFP.
We still don’t know as an industry what is the right size of a pod or a cluster, and this will become even more complicated as we move to the next-generation rack architectures from Nvidia that could push rack densities to one megawatt. If you have a one-megawatt rack, then obviously your CDU needs to be several orders larger than that; it doesn’t make sense to map one CDU to one rack on a one-to-one basis. The other thing to think about is the surprise thrown by Jensen Huang a couple of days ago, where for every Rubin rack, you’re going to have five companion decode racks. I’m yet to see the detailed specifications we know about Rubin rack densities, we don’t yet know about the LP30, I think that’s what it’s called. But if six racks are working together to deliver inference performance, what does that mean for the size of the CDU? In India, we may not see these things deploying immediately, because Nvidia would be shipping only in Q3 of this year to begin with, and only in the US. But all these things will change the CDU sizes we are visualizing. Would they need to be six-megawatt CDUs at some point in the next two years? Perhaps. Would they become much larger than what they are today? Definitely yes, because you need larger heat exchangers, and the algorithms need to continuously evolve at these extreme rack densities and extreme workloads. You’re no longer talking about a 15-kilowatt server, you’re possibly talking about a 200-kilowatt server all by itself.
Sanjana Mandavia: As you’re integrating AI into cooling optimization, are we moving towards self-optimizing thermal infrastructure?
Satya: It’s moving towards a self-optimizing thermal infrastructure, not just thermal infrastructure, but the entire infrastructure, which includes power and thermal. Power includes your grid power, your gensets, your UPSes, because all of them are fallback mechanisms, as well as on the thermal side, both the primary and secondary sides. To give you a simple example: if my workloads for whatever reason are only running at 10%, the chiller doesn’t need to supply me with extremely cold water; it can take it easy, to put it in common parlance. Today there is no communication between the CDU and the chiller to indicate that things are changing on one side and behavior should change on the other.
So we want this entire system to be integrated and behave like a unified whole self-healing, self-improving, self-optimizing to drive better PUE outcomes. This would require collaboration between the various entities within the data center ecosystem, including all the way from and this is an interesting conversation I’ve had with a leading semiconductor CPU vendor and a very large server OEM they have so much telemetry coming out of CPUs and servers that goes completely to waste; parts of it are displayed on the DCIM or BMS software. The thought process is: if we can bring in a large DCIM vendor and we are working towards that goal, and hopefully by end of next quarter we would start actual work on this a server vendor, a thermal management vendor such as Refroid, a chiller manufacturer, and of course a DCIM vendor, to see if all of us come together, what kind of insights and optimizations can we deliver? Because as I’m sure you know, if we move the PUE from 1.2 to 1.19, it delivers several millions of dollars of savings per megawatt per year, and the data centers become that much more sustainable. So all this work is ahead of us, and it’s not simple work, it’s extremely challenging and complex. It may take us a couple of years to figure this puzzle out.
We are also working with other very large global companies to optimize digital twin products that they have, so that even while modeling the software, you could take in some data from the intelligent CDUs, you could take in data from the Hybrid Load Banks, so that at the design stage itself we are trying to reimagine the way data center simulation happens.
Sanjana Mandavia: As we have seen AI-driven optimization becoming a buzzword across infrastructure, and as Refroid is targeting PUE and energy efficiency improvements, what tangible efficiency gains, PUE reduction, energy savings, or cost impact can you speak to?
Satya: Energy saving equals cost impact, and energy for the cooling infrastructure turns out to be somewhere between 40 to 50% of your total OPEX. So if you think about it, if you’re able to bring down that 40 to 50% of OPEX even by 5% or even 10%, you’re delivering significant savings to the data center operator. You’re actually changing the economics. Data centers may be looking today at about a six-year horizon to recover all their investments that could happen six months sooner or a year sooner, which is huge for the entire data center industry. Of course, energy efficiency means you’re also consuming less power, and if your power sources are non-renewable, then you’re reducing the carbon footprint. If the data center is running very efficiently, it means your water usage effectiveness would also improve. So all in all, it’s net gains for the data center operators and for society at large.
Sanjana Mandavia: You have built redundant sensors, leak detection, and hot-swappable components. How critical is reliability engineering in liquid cooling adoption?
Satya: Super important, because nobody wants to if you shut a CDU down, that means you’re shutting $18 million worth of compute down, assuming that there are six racks being supported by that CDU. If you’re looking at how many tokens they generate and what the price of those tokens is, the cost of each premium token today is about $0.001 to $0.03 per 1,000 tokens, and you could generate somewhere around we’re talking about mature deployments about 1.1 million tokens or so per second if you have a GB300 rack. So the math, if you do it, is mind-boggling. The economic impact of shutting a CDU down.
So what you do instead is build in as much redundancy as possible, as much hot-swappability as possible, so that you never need to shut a CDU down, never need to shut down your compute, never need to shut down your services. At every point whether it is temperature, pressure, or leak detection, and various other parameters we have built in, consciously, redundancies so that if one sensor fails, the operation continues and we can take the faulty sensor out and do a hot swap. It’s a question of a few minutes to one hour while this entire swap can happen without impacting the operations of the CDU. Besides these, we have built in additional sensors for turbidity to monitor the coolant health continuously, because that’s another thing that could result in poor performance if the coolant gets contaminated.
Although there are 25-micron filters built into every CDU, we don’t know what happens over a period of time. We have not run liquid-cooled data centers anywhere in the world at scale for a decade; they’ve only been running for three or four years. And the coolants themselves we’re transitioning; new coolants are coming out like PG50, new vendors are coming in. A coolant that works flawlessly in North America would work as long in India, given the different ambient conditions? You can do theoretical computations, but practically we still don’t know. So it’s important to monitor as many characteristics of every single element of the thermal management system as possible. We get a lot of data, and this data can be used in the next design cycle; it can be fed back to the likes of Dow or Castrol. And this is where the entire industry needs to come together, because it’s what is inevitable. The writing on the wall is clear. AI is going to grow even faster. We’re just at the foot of the curve. We’re somewhere in the middle of the training era. Inference is just about starting, and agentic, there may be a few thousands or tens of thousands of people experimenting with it. We have not seen widespread adoption of inference and agentic in enterprise and commerce. These things are going to be secular drivers that continuously and dramatically expand the data center capacity and the demands on thermal management.
One thing I will not speculate about is what is next in terms of the core thermal technologies themselves whether single-phase direct contact liquid cooling will give way to single-phase liquid immersion cooling, or two-phase direct contact liquid cooling, or a hybrid where the entire server is immersed in liquid but the cold plates use a two-phase change by turning the liquid into vapor and back. And then we’re talking about things like microfluidics, where cooling can get into package level inside the semiconductor fabrication and packaging process. So it’s a tremendously exciting space. It’s a huge responsibility. Because ultimately, if you look at what thermal management is doing at TSMC’s 3nm process node, per square millimeter there are approximately 321 million transistors packed. A Rubin die is estimated to be at the TSMC reticle limit of 850 square mm. You can count the number of transistors that are there. Which means there are at any given point in time several trillions of electrons doing their delicate dance to ensure that they deliver these very complex tensor computations and matrix multiplications so that the applications that run on top of this compute infrastructure run flawlessly. The densities are going to only grow, and the more transistors you have, the more heat is generated. At 3 nanometers, moving to 2nm by the end of the year or early next year, and Intel moving to their 1.8 Angstrom process the thermal management problem only intensifies.
Sanjana Mandavia: SentraFLO supports dark or autonomous data center operations. Do you see cooling systems becoming fully autonomous in the near future?
Satya: Ideally yes, but in the near future I don’t know. Given the costs and stakes involved, data center operators are highly risk-averse. It depends on how the ecosystem comes together, specifically the power and thermal management in conjunction with the DCIM and the ability of these systems, over a year or two, to demonstrate that dark data centers can indeed operate. Seeing is believing. It needs to be evidence-based and evidence-driven. I believe that’s going to happen. It’s a question of accumulating enough evidence.
On the other side, the challenge is that compute densities are becoming so extreme. For example, today just as a thought experiment if you had H200-based compute data centers, several of which exist in the US, and you have a couple of years of data, then you’ll be able to simulate and say: yes, this data center can run autonomously in the next few years. But for the new data centers coming up with new compute, new GPUs, and new rack densities it’s going to be a challenge. It’s a race. You need to have enough data before you can confidently say, “Yes, I can be an autonomous data center.”
Sanjana Mandavia: That’s really interesting, because technically it sounds like most of the failure scenarios are already being addressed by Refroid. That said, can SentraFLO plus the thermal management system together be seen as a complete lifecycle solution starting from validation to live cooling optimization?
Satya: Absolutely. Yes, that is what we are looking at, not by ourselves. We can provide the infrastructure and the validation, but we are trying to forge some relationships with, as I mentioned, people in the simulation space who build digital twins, who use our data so that data centers can be modeled before they’re built. Then our infrastructure comes in. Then our thermal validation through the ThermIon Hybrid Load Banks comes in. Then the DCIM for continuous optimization. This is our overall roadmap and product vision. We build some pieces; much larger companies build the other pieces. Together, in the next couple of years, we would like to reimagine the way data centers are designed, built, validated, and continuously optimized for energy efficiency.
Sanjana Mandavia: Most of the data centers are still hybrid or air cooled. What’s the realistic timeline for liquid cooling to become mainstream whether it’s five years or fifteen years?
Satya: There are a couple of ways to look at it. In the AI world, by the time we get to Rubin Ultra as they call it which would be the end of 2027 when it is likely to sample, with mainstream availability in 2028, the industry grapevine suggests that those racks would be 100% liquid cooled with no air anymore. Whether it will happen at that point in time or will we need to wait till the Feynman architecture, we’ll need to wait and watch. But what industry insiders are saying right now is that from Rubin Ultra, it’s all going to be liquid cooled.
What’s happening in the enterprise world is that the traditional data centers running enterprise HPC types of workloads the CPUs themselves are becoming more and more power-hungry. If you look at some of the Intel presentations publicly available, and some leaks here and there, you’re talking about the latest 7th generation Xeon Diamond Rapids going in some configurations all the way up to 700 watts per socket or per processor. That’s again the tipping point between air and liquid. So I visualize that in the next two to three years, even the CPU-focused segments will gravitate toward liquid cooling. I believe seven years from now, the entire cooling technology heat dissipation would have to be through liquid.
There is a favourite example I keep giving: even at the Rubin level, assuming that the unofficial specs from the industry are true, Rubin GPU (Max-P) generates about 2,300 watts from a single package. If you extrapolate that to one square meter, just as a thought experiment, you’re talking about dissipating the equivalent of 1.3 megawatts per square meter. That makes it, in my estimation and based on the reading I’ve done, the third most thermally challenging problem that mankind has. What is the toughest? Rocket boosters, which can generate up to the equivalent of 17 to 18 megawatts per square meter. The second most thermally challenging problem is the innermost wall of a nuclear reactor, which can go anywhere between 5 to 7 megawatts of equivalent heat. But those two problems are very different from the GPU problem. Those two are handled with material science, simply to maintain the structural integrity of the machinery whether it’s a reactor or a rocket booster. But in the case of semiconductor processes manufactured with CMOS, you have to make sure they’re operating below 85 degrees celsius cooler than a cup of coffee on your table that you would have drunk this morning. So this makes it a truly incredible challenge, and we don’t know where chip TDPs will grow in the next few years. We have already heard of 4,000 watts per processor from information available in blogs and leaks. And beyond that, the thermal management industry is going to be literally hot for a very long time.
Sanjana Mandavia: Refroid is building deep-tech hardware in India. What were the biggest challenges in developing such an advanced system locally?
Satya: There are positives and negatives. On the negative side, the supply chain is not very robust in India; that’s something we’ve had to work through. Thankfully, for several of the Class A components, global vendors are available in India, and we get to use the same components and tap into the same engineering teams as our global competitors. But on the other side, there is nobody to learn from. There were no liquid cooling operations in India, no data centers for us to even go and look at a CDU and how it works. So we’ve had to rely a lot on simulation, build, validate, and scrap going through that iterative process. Being a startup helped us shift our failures as much to the left as possible as they say in software testing to advance quickly. The CDU has taken us almost one and a half years to build. The ThermIon Hybrid Load Banks took about 13 months. That’s partly because we had started with liquid immersion cooling and had built a CDU for liquid immersion first that took two years and several concepts from that we could borrow. But being a hardware startup in India is different. If we need to test our CDU at one megawatt capacity, for example, I need to hire a diesel generator and a chiller to test it. It’s expensive. But we’re doing it because we’re committed. We believe the market opportunity is enormous. We believe we are building a differentiated offering, and we believe we’ll put India on the global data center infrastructure map. That’s what is driving us.
Sanjana Mandavia: The very last question, if you had to pitch Refroid to a hyperscaler in one line, what would you say?
Satya: This is a company driven by innovation and validated test data. We are pushing the envelope on how the future of thermal infrastructure should look.
Conclusion
From hybrid load simulation to AI-driven cooling systems, Refroid is clearly pushing the boundaries of how data centers are tested and scaled. If AI is the engine of the future, cooling may very well be the constraint that defines it and that, in essence, is what Refroid is actively working to solve. The company is not just participating in the evolution of data centers; it is actively redefining how cooling infrastructure is built, validated, and optimized with a distinctly Indian engineering fingerprint, and ambitions that are unmistakably global.
As AI models grow larger, as rack densities push toward and beyond one megawatt, and as the industry faces what Satya describes as the third most thermally challenging problem mankind has ever encountered, the companies that understand cooling not just as heat rejection but as an intelligent, adaptive, workload-aware system will define the economics and sustainability of the next era of compute. Refroid, quietly and rigorously, is positioning itself at the center of that challenge.
