The Inferencing Explosion: Why AI’s Biggest Infrastructure Buildout Hasn’t Started Yet

Share the Post:
AI infrastructure

The Phase Shift Nobody Budgeted For

The narrative that dominated AI infrastructure investment in 2023 and 2024 was almost entirely about training. The story was clean and compelling: larger models required more compute, more compute required more GPUs, more GPUs required more data centres, and more data centres required more power, more cooling, and more capital than anyone had previously committed to building in a single technology cycle. Hyperscalers responded by committing hundreds of billions of dollars to training-oriented infrastructure. Nvidia became the most valuable company in the world by market capitalisation for a period. Data centre construction pipelines swelled to capacities that grid operators from Texas to Amsterdam described as unprecedented.

That story was accurate while it was running. It is now incomplete. Bain’s June 2026 analysis of the AI data centre market frames the current moment with deliberate directness: inference is now the centre of gravity. Deloitte’s forward-looking analysis put specific numbers behind the same observation, tracking inference’s share of total AI compute from a third of all workloads in 2023, to approximately half in 2025, to roughly two-thirds heading into 2026, with Brookfield’s longer-range projections suggesting inference will represent three-quarters of all AI compute by 2030. This is not a marginal rebalancing within an otherwise stable infrastructure picture. It is a structural shift in the dominant demand driver of the world’s largest technology infrastructure investment cycle, and it has infrastructure implications that the training-era buildout did not fully anticipate.

What makes this shift consequential beyond the raw numbers is a set of physical and operational characteristics that distinguish inference from training in ways that demand fundamentally different infrastructure responses. Training is a batch process. It happens once, or periodically, at a small number of centralised hyperscale facilities where massive GPU clusters can be co-located for the weeks-long runs that producing a frontier model requires. Inference is a continuous process. Every query submitted to every AI assistant, every document processed by every enterprise copilot, every step executed by every autonomous AI agent, every retrieval-augmented generation call triggered by every enterprise knowledge management system, generates inference demand in real time, continuously, distributed across the full geographic range of the users and systems generating it.

Deloitte’s conclusion that the world likely needs a lot of data centres, tracking a trajectory from roughly three hundred to four hundred billion dollars in AI data centre capital expenditure in 2025 toward an estimated one trillion dollars in 2028, captures the macro scale of what the inference shift implies. The more operationally precise observation is that the specific facilities the inference era demands differ from the training-era campuses already under construction in their location requirements, their hardware specifications, their networking architecture, their cooling profile, and their relationship to the regulatory and data sovereignty frameworks that govern where and how enterprise AI can actually be deployed. The training buildout is not over. The inference buildout has barely begun, and when it arrives at full scale, it will create a larger and more geographically distributed data centre footprint than the training era produced.

Why Training Gets the Attention and Inference Gets the Demand

The Asymmetry That Changes Everything

Training a frontier large language model is a visible, legible event. It requires a defined budget, a defined timeline, a defined cluster of hardware operating at a defined location, and it produces a defined output: a set of model weights that represents the completion of the training run. The economics of training are dramatic in absolute terms, but they are episodic. A company trains a model, spends what the training requires, and then the training run ends. The infrastructure it consumed is either repurposed for the next training run or sits partially idle between cycles.

Inference is none of these things. It is not episodic. It does not end. Every user session, every agentic task, every enterprise API call, every copilot interaction generates a stream of inference requests that continues for as long as users exist and models are deployed. The business logic of inference is fundamentally different from training: training is an investment with a defined cost and a defined deliverable, while inference is an ongoing operational expenditure that scales directly with usage, which means it scales with commercial success rather than with research ambition. A company that successfully deploys an AI product does not get to stop paying for inference when the model is trained. It pays more for inference the more successful the product becomes, and that payment funds a continuous, growing demand for the specific kind of infrastructure that inference requires.

The Credo SVP framing of 2026 as the year of inference, with AI shifting from training to decode-heavy applications requiring significantly more memory capacity and AI spending moving from traditional hyperscalers to neoclouds needing vendor support for rapid deployment, captures one dimension of this shift. The deeper structural point is that inference demand is generated by an entirely different set of actors than training demand. Training is dominated by a small number of frontier AI laboratories and hyperscalers with the capital and technical capability to manage multi-thousand-GPU training runs. Inference demand is generated by every enterprise that deploys an AI product, every developer that builds on a model API, every government institution running AI-assisted services, and every consumer using an AI assistant. The population generating inference demand is orders of magnitude larger than the population generating training demand, and it is growing faster, because enterprise AI adoption is accelerating precisely as the infrastructure conversation is shifting from training to inference.

The Continuous Demand Machine Behind Every AI Product

The specific AI products now generating inference demand at scale illustrate why the buildout implied by that demand has barely started. A Microsoft 365 Copilot deployment across a large enterprise does not generate a fixed number of inference calls per day. It generates inference calls every time any of the enterprise’s employees uses the product for any purpose, across every workday, continuously. A customer service AI assistant deployed by a retail enterprise runs inference on every customer interaction across every channel, around the clock. A RAG application connecting an enterprise’s internal knowledge base to an AI interface runs an inference call for every query any employee submits to any enterprise knowledge management system.

Each of these individual interactions generates a small number of inference calls. Aggregated across millions of enterprise deployments, billions of consumer users, and the expanding population of autonomous AI agents executing multi-step tasks that may each involve dozens of sequential inference calls per agent workflow, the aggregate continuous inference demand is genuinely different in character from any prior compute demand wave. It resembles the demand profile of the internet itself more than it resembles the demand profile of a scientific computing workload: distributed, continuous, latency-sensitive, and growing in proportion to adoption rather than in proportion to research investment. Built In’s January 2026 framing of the shift from bloated multi-billion training deals toward inference factories closer to users is a directional description of this geography of demand, though the word factory implies a scale and permanence that the market is only beginning to put physical infrastructure behind.

The SuperX AI inference cloud hub launch in Denver in June 2026, specifically noting that AI-focused technology companies had reserved a significant portion of the facility’s initial inference capacity before the hub was even operational, illustrates how acute the supply shortage already is in regional inference markets that the training-era buildout did not specifically serve. Denver is not a primary hyperscale data centre market. It does not sit at the junction of major subsea cables. Its appeal, from an inference infrastructure perspective, is precisely its regional centrality for enterprise customers in the American interior who need low-latency inference access without routing through coastal data centre hubs. The FingerMotion modular edge data centre initiative announced in June 2026, framing localised AI inference compute as a strategic extension of telecommunications infrastructure, reflects the same geographic logic from a different market position.

The Infrastructure Requirements That Training Left Unsolved

Power: Dense, Distributed, and Continuously Drawn

Training clusters and inference deployments both consume power, but they consume it differently in ways that change the infrastructure calculus for each. A training cluster runs at near-constant maximum utilisation during an active training run, with power demand that is predictable, concentrated, and manageable as a single large load at a single facility. The power infrastructure required to support it, a single very large grid connection or a co-located generation asset, is a large but tractable engineering problem. Inference infrastructure at scale runs continuously at varying utilisation levels across multiple facilities in multiple geographies, with demand patterns driven by user behaviour rather than by a controlled compute schedule, and with latency requirements that limit how far any given query can travel before the response time becomes unacceptable.

The consequence is that inference infrastructure cannot simply be concentrated at a single high-power campus the way training infrastructure can. It must be distributed across regions in proportion to user density, which means it must secure multiple grid connections at multiple locations, navigate multiple regulatory environments and grid operators, and maintain the power reliability standard that always-on inference services require across all of them simultaneously. The Gartner projection that global data centre electricity demand will exceed one thousand terawatt-hours by 2026, double the 2023 baseline, incorporates inference growth as a primary driver, and the Department of Energy projection that United States data centres could reach twelve percent of national electricity consumption by 2028 reflects the compound effect of training and inference demand growing simultaneously rather than sequentially.

Tokens per watt has emerged as the metric that captures the operational efficiency imperative that inference economics create in a way that power usage effectiveness alone does not. A facility with an excellent power usage effectiveness ratio but running inefficient inference workloads wastes more energy per useful output than a slightly less efficient facility running highly optimised inference at full utilisation. The inference era is shifting infrastructure optimisation from facility-level energy efficiency metrics toward workload-level output efficiency metrics, and the gap between these two ways of measuring performance is large enough that it changes which sites, which hardware configurations, and which cooling architectures make commercial sense for inference-optimised facilities compared to the configurations that made sense for training campuses.

Cooling: A Different Density Profile Demands a Different Response

The cooling requirements of inference infrastructure differ from training in a specific and operationally significant way that the industry has been slow to communicate clearly. Training clusters pack the highest-density GPU hardware into the smallest possible footprint to minimise the latency between GPUs during the all-to-all communication patterns that distributed training requires. Inference does not have this constraint. The GPU-to-GPU communication requirements of inference workloads are substantially lower than those of training, because inference operates server-by-server rather than requiring the tight synchronisation across thousands of GPUs that training demands. Inference facilities can therefore distribute their compute load across more, lower-density racks rather than concentrating it into the ultra-high-density configurations that training campuses favour.

The range Dell’Oro Group cited for inference rack densities, thirty to one hundred and fifty kilowatts, reflects this distribution. That range remains significantly above traditional data centre designs, which were built around five to fifteen kilowatts per rack, and requires liquid cooling in its upper portions. But it is also well below the one hundred and fifty to two hundred-plus kilowatt densities that frontier training clusters now demand, meaning that inference facilities, particularly regional and near-edge deployments designed to serve enterprise customers rather than hyperscale training runs, can be built with cooling architectures that are less capital-intensive than the full direct-to-chip systems that Nvidia’s Vera Rubin requires. Rear-door heat exchangers, hybrid air-and-liquid architectures, and direct-to-chip cold plates running at moderate density all remain viable cooling approaches for a substantial portion of inference demand, which makes the retrofit economics for existing facilities meaningfully better for inference workloads than for training workloads at the same tier of hardware.

The near-edge deployment pattern that Dell’Oro Group described, in which inference facilities favour smaller but highly dense accelerated clusters with strong requirements for high-speed networking, local storage, and redundancy, creates a distinct facility category that has not previously existed at meaningful scale in the data centre market. These facilities are not hyperscale campuses. They are not traditional enterprise co-location facilities. They are inference-optimised, networking-forward facilities at a size range of perhaps ten to one hundred megawatts, sited for proximity to enterprise user populations rather than for proximity to the grid-connected power abundance that training campuses require, and equipped with the liquid cooling, high-speed fabric, and local storage that production inference at these densities demands. The industry does not yet have a standardised term or a standardised reference architecture for this facility category. Building it out is the primary infrastructure task of the inference era.

The Networking Constraint That Is Already Binding

The Cisco Moment for AI Infrastructure

The Unified AI Hub’s characterisation of 2026’s defining infrastructure constraint as networking rather than GPU availability or power capacity is one of the most operationally significant observations in current AI infrastructure analysis, and it has received considerably less attention than the power and cooling conversation despite being, in practice, the constraint that is determining which inference deployments actually work in production rather than just in testing. The argument is direct: the companies winning in the current AI infrastructure environment are not necessarily those with the most GPUs. They are the ones who have figured out how to interconnect and orchestrate data across regions, clouds, and edge locations without creating massive bottlenecks, and the analogy to the Cisco moment of the late 1990s internet boom, when the real long-term value went to networking plumbing rather than flashy websites, captures the structural parallel with unusual precision.

AI training workloads communicate through proprietary high-speed interconnects, NVLink for within-rack GPU communication and InfiniBand for between-rack communication, both of which require specialised networking hardware and tightly controlled fabric architectures that hyperscale AI campuses have been built to accommodate. Inference workloads operate differently. Individual inference queries do not require the sustained, high-bandwidth all-to-all communication patterns of training. However, production inference at scale involves continuous data movement between user-facing endpoints, model serving infrastructure, retrieval systems, vector databases, and orchestration layers that aggregate to a networking demand profile that traditional enterprise data centre networking was never designed to handle at AI inference latency standards.

The IEEE P802.3dj standard, introducing 200 Gbps signaling for speeds up to 1.6 terabit Ethernet with products already shipping as of 2026, and the industry movement toward defining 400 Gbps signaling as the foundation for 3.2 terabit Ethernet and beyond, reflect the networking industry’s response to this demand. The Astera Labs prediction that tiered memory architectures for large-scale inference workloads would be front and centre in 2026 points to a parallel dimension of the same constraint: inference workloads require fast memory access to model weights and context windows, and the memory bandwidth available in current server architectures limits how efficiently inference can be served at low latency without buffering that degrades response time. The combined effect of networking and memory bandwidth constraints means that the infrastructure limiting inference quality for many enterprise deployments today is not the GPU itself, but the pathways connecting the GPU to the data it needs to process and the users it needs to serve.

Why Inferencing Could Exceed Training’s Infrastructure Footprint

The Multiplication Effect of Enterprise AI Deployment

The single most underappreciated structural feature of the inference era is the multiplication effect that enterprise AI deployment creates relative to the consolidation dynamic that training infrastructure exhibits. Training naturally consolidates: the economics of running very large training runs favour concentration at a small number of hyperscale facilities where thousand-GPU clusters can be assembled, where power can be procured at the scale those clusters demand, and where the engineering expertise required to manage distributed training can be maintained in a single location. The total number of facilities running frontier training runs at any given time is relatively small, measured in dozens globally rather than thousands.

Inference naturally distributes. Every enterprise that deploys an AI product becomes a source of inference demand. Every cloud provider that hosts enterprise AI workloads builds inference infrastructure proportional to its customer base. Every national government with a data sovereignty requirement that prevents routing inference traffic through foreign-operated cloud regions builds or procures domestic inference infrastructure. Every telco with a strategy to offer AI-powered services needs inference infrastructure deployed near its network, at its scale, within its geographic footprint. The ABI Research projection that the United States AI data centre capacity will grow from roughly 8.2 gigawatts in 2026 to 21.4 gigawatts by 2031, more than doubling in five years, incorporates this multiplication effect, but it captures only the United States. The global inference buildout will involve comparable growth trajectories in European, Asian, and eventually African and Latin American markets, each requiring region-specific infrastructure serving region-specific user populations under region-specific regulatory frameworks.

BloombergNEF’s tracking of over twenty-three gigawatts of data centre capacity under construction globally at the end of September 2025, with roughly three-quarters of it in the United States, captures the current phase of infrastructure investment that is still predominantly training-oriented and United States-centred. The inference buildout that will follow as enterprise AI adoption scales globally will be geographically much more distributed than this figure suggests, because inference demand follows user populations and regulatory boundaries in ways that training demand, driven primarily by the resource requirements of a small number of frontier AI laboratories, does not. The aggregate data centre footprint of the inference era will likely exceed that of the training era precisely because of this distributional dynamic, even if the average power density of individual inference facilities remains somewhat lower than the highest-density training campuses.

The Regions Best Positioned for the Next Wave

Nordic and North American Markets With Infrastructure Ahead of Demand

The geography of inference infrastructure investment is being determined by the same variables that this analysis has identified as the primary constraints on inference deployment: latency to end users, grid access for continuous power, regulatory environment governing data residency, and proximity to enterprise customer concentrations. The regions that have invested in grid infrastructure, built regulatory frameworks that welcome data centre development, and established the connectivity required to serve as regional inference hubs are capturing investment flows ahead of markets that have not addressed these prerequisites.

The Nordic countries, which this analysis has noted in the context of their waste heat reuse leadership and renewable energy infrastructure, are also well-positioned for inference investment for more straightforwardly commercial reasons. Denmark’s Thylander Data Centers initiative, specifically framed around providing Danish companies and external companies who see Danish markets as valuable with data sovereignty-compliant inference infrastructure that is not controlled by foreign hyperscalers, illustrates the sovereign inference infrastructure demand that is emerging across European markets simultaneously with the pure commercial demand from enterprise AI adoption. Sweden, Finland, and Norway each offer the combination of renewable power abundance, grid stability, permitting environments that accommodate data centre development, and subsea cable connectivity that inference infrastructure serving European enterprise demand requires. Spain’s Madrid market, growing faster than any other major European hub, offers proximity to Southern European and Latin American enterprise customers alongside improving grid access.

In North America, the inference buildout is extending beyond the traditional hyperscale markets of Northern Virginia, Silicon Valley, and Seattle into regional markets including Denver, Dallas, Atlanta, Chicago, and Toronto, each of which serves enterprise customer concentrations that require lower inference latency than east-coast or west-coast facilities alone can provide. Canada’s position is strengthening specifically for inference serving American enterprises with data sovereignty requirements, with AWS, Google, and Microsoft all expanding Canadian capacity and Beacon AI Centers targeting up to 4.5 gigawatts of Canadian data centre power capacity on a 2027 to 2030 timeline. Alberta’s combination of natural resources, cool biome, and strong grid capacity gives it a specific advantage for inference workloads where cooling cost and energy availability are primary economic drivers. The eStruxure Data Centers facility in Alberta expected by late 2026 represents an early signal of this regional diversification.

Asia Pacific and the Inference Sovereignty Imperative

The Asia Pacific region presents both the largest inference demand opportunity of any geography and the most complex set of structural barriers to serving it from a unified infrastructure footprint. Japan’s documented ninety-four percent Microsoft 365 Copilot penetration among Nikkei 225 companies, combined with above-average consumer generative AI usage, represents a mature enterprise inference demand base that domestic infrastructure is being built to serve through the Microsoft, SoftBank, and Sakura Internet partnership. South Korea’s hyperconnected enterprise market and Australia’s combination of English-language enterprise demand and strict data sovereignty requirements both create national inference infrastructure investment cases that regional hyperscale campuses in Singapore cannot fully serve within the latency and data residency requirements that enterprise contracts increasingly specify.

India’s inference demand potential is extraordinary by any measure, with a population generating internet traffic at a scale that should correspond to a data centre footprint many times larger than what currently exists, while the grid constraints discussed elsewhere in this analysis continue to limit how quickly that potential translates into operational infrastructure. Southeast Asian markets including Indonesia, Vietnam, and the Philippines are developing inference demand bases large enough to justify dedicated regional infrastructure, and their governments are developing data localisation policies that will make cross-border routing increasingly difficult for enterprises seeking to serve local customers at scale. The Middle East, where sovereign AI investments led by Saudi Arabia and the UAE are building training-oriented capacity at significant scale, is simultaneously building the enterprise ecosystem that will generate inference demand from that capacity over the medium term.

The common thread connecting all of these regional inference infrastructure stories is a data sovereignty dimension that was largely absent from the training-era buildout. Training clusters can be located wherever power, land, and capital converge most favourably, because the model weights they produce can be distributed globally after training is complete. Inference infrastructure cannot be freely located with the same flexibility, because inference serves live users whose data must be processed under the legal frameworks governing the jurisdictions in which those users reside. Every new data localisation law, every new AI governance requirement, and every new national sovereign AI initiative creates an additional demand signal for domestic inference infrastructure that could not be served by importing trained models from foreign data centres without also routing the inference traffic through those foreign facilities. The inference era’s infrastructure buildout is, in this sense, also a regulatory buildout: the infrastructure follows the regulation as much as it follows the demand, and the regulatory environment governing AI data processing is tightening in exactly the regions where inference demand is growing fastest.

Test-Time Compute and the Infrastructure Multiplier Nobody Priced In

When Thinking Harder Means Spending More

A dimension of the inference infrastructure demand picture that most forecasts have not yet fully integrated is the specific impact of test-time compute scaling, the architectural approach in which models are given substantially more computation during the inference process itself to generate more accurate responses. OpenAI’s o-series models, Anthropic’s extended thinking implementations, and Google’s Gemini 2.5 reasoning capabilities all reflect this paradigm, in which a single user query may trigger not one forward pass through a model but dozens or hundreds of sequential reasoning steps, each of which constitutes an individual inference operation before the final response is returned.

The infrastructure consequence of this approach is straightforward and significant. A model that answers a query with extended reasoning consumes five to twenty times the compute resources of a model that answers the same query in a single forward pass, depending on the length and complexity of the reasoning chain involved. At the level of any individual query, this is manageable. At the level of millions of concurrent enterprise users, each of whom may be running agentic workflows that involve multiple sequential reasoning steps across dozens of tool calls and context retrievals, the aggregate inference compute demand is substantially higher than earlier estimates, which were calibrated to single-pass inference architectures, anticipated.

Bain’s identification of test-time compute as reshaping infrastructure strategy, economics, and architecture, with meaningful implications for data centre colocation versus self-build, silicon diversity, and power provisioning, reflects the specific way this architectural shift lands on infrastructure planners. A facility designed to serve enterprise inference demand at the compute density appropriate for single-pass inference models will find itself undersized for the same user population once extended reasoning becomes the default mode of operation for the AI products those users are accessing. The inference infrastructure that enterprises are commissioning today may need to be substantially expanded, or replaced with higher-density configurations, within three to five years of deployment if test-time compute scaling continues to increase the per-query compute burden at the pace the current trajectory implies. Planning for this trajectory, rather than planning for a static inference compute requirement, is the infrastructure design challenge that distinguishes the operators who will be positioned for the full inference era from those who will find themselves capacity-constrained precisely when enterprise AI adoption reaches the scale that justifies the investment they are currently making.

The Infrastructure Moment That Comes After Training

The most important reorientation the inference era demands from infrastructure planners is a shift in how they conceptualise the source of data centre demand. Training demand is a function of research ambition: a handful of frontier AI laboratories and hyperscalers deciding how often to run frontier model training campaigns, at what scale, determines a knowable and relatively bounded infrastructure requirement that can be planned for years in advance. Inference demand is a function of commercial adoption: every enterprise that successfully deploys AI into its workflows, every consumer product that reaches sustained daily active usage, every government service that converts to AI-assisted delivery, adds a permanent increment to the global inference load that does not go away when the product reaches maturity but instead grows as the product scales.

The implication for infrastructure investment, capacity planning, and regional development is that the training buildout, which has dominated the AI infrastructure narrative for three years and which the equity markets have been pricing for longer, is the smaller and shorter-duration phase of the two. It was the dramatic opening act, visible and legible as a defined investment cycle with defined beneficiaries and defined milestones. The inference buildout is the less dramatic but structurally larger second act, distributed across thousands of facilities in hundreds of markets, driven by enterprise adoption curves rather than by laboratory research decisions, and constrained by the combination of grid access, latency requirements, and data sovereignty regulations that no single site selection or infrastructure investment can fully resolve globally.

The companies and regions that are already building the networking fabric, the regional inference facility capacity, and the data sovereignty-compliant infrastructure that the inference era demands are positioning themselves ahead of a demand wave whose full scale has not yet materialised. The training GPU cluster is the image the industry reached for when it tried to visualise AI infrastructure. The inference factory, distributed, regional, continuously-running, and growing with every new enterprise AI deployment, is the physical reality that the next decade of AI infrastructure investment will build. The buildout of training infrastructure was historic. The buildout of inference infrastructure will be larger.

Related Posts

Please select listing to show.
Scroll to Top