The AI infrastructure conversation has a hardware bias. GPUs, custom silicon, transformers, and power capacity dominate the analytical framework that most of the industry uses to understand where the scaling ceiling is and what will determine who hits it first. That bias is producing a blind spot. As GPU clusters scale from tens of thousands of chips toward hundreds of thousands, the constraint that is increasingly governing actual training performance is not compute capacity. It is network reliability. A single delayed data transfer in a tightly synchronised 100,000-GPU training cluster causes the entire cluster to wait, leaving millions of dollars of compute sitting idle while one straggler node catches up.
OpenAI calls this the straggler effect, and in a technical post published May 5 alongside the release of the Multipath Reliable Connection protocol, the company described network congestion, link failures, and device failures as the most common sources of delay and jitter in large-scale AI training. MRC is OpenAI’s answer to that problem, and the fact that it released it as an open standard through the Open Compute Project, developed collaboratively with AMD, Broadcom, Intel, Microsoft, and Nvidia, is the detail that makes it consequential beyond OpenAI’s own infrastructure.
Why MRC Changes Ethernet Networking Economics
MRC is a transport protocol that replaces the single-path model of conventional RDMA networking with multipath packet spraying across hundreds of independent network paths simultaneously. Where traditional RoCE networking assigns each data transfer to a single path, creating congestion when multiple transfers collide on the same link, MRC distributes a single transfer across as many paths as the network offers. When one path fails, only the fraction of packets on that path needs retransmitting rather than the entire transfer. Microsecond-level failure detection and rerouting eliminates the cascade of cluster stalls that conventional Ethernet fabrics experience during hardware failures.
OpenAI already deploys MRC across its largest Nvidia GB200 supercomputers, including the Stargate site with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers. MRC is not a research concept. It is production infrastructure already shaping the training performance of some of the world’s most consequential AI systems.
Why the Open Compute Project Donation Changes Everything
The strategic significance of MRC is not its technical design, which is sophisticated and well-documented. It is the decision to contribute the MRC specification to the Open Compute Project rather than keeping it proprietary. OpenAI’s Stargate infrastructure requires Nvidia, AMD, Broadcom, and Intel all to implement MRC in their networking hardware for the protocol to deliver its full performance benefits across heterogeneous cluster environments. No proprietary standard could achieve that cross-vendor implementation breadth. By contributing MRC to the OCP, OpenAI created the governance structure needed for universal hardware support without requiring each vendor to trust a competitor’s proprietary specification.
The OCP contribution also signals something important about OpenAI’s infrastructure strategy more broadly. A company that contributes a protocol it developed and deployed in production to an open standards body is making a calculated bet that the value it captures from universal MRC adoption exceeds the value it would capture from MRC as a proprietary advantage. That calculation is consistent with how foundational infrastructure standards have historically emerged: when a technology is bottlenecking the entire market rather than just one player, the market leader who removes the bottleneck through open standardisation captures more value from the resulting acceleration than it would from maintaining a proprietary solution. Sameh Boujelbene, vice president at Dell’Oro Group, described the OCP donation as a strong signal that hyperscalers are leaning harder into Ethernet for AI fabrics, particularly as clusters push toward 100,000 to 500,000-plus GPUs.
The InfiniBand Displacement Implication
MRC’s emergence as an open standard has a specific and commercially significant implication for Nvidia’s InfiniBand networking business that the hardware-focused AI infrastructure analysis has not fully processed. InfiniBand has dominated large-scale AI training networking for a decade because its low latency and tightly integrated performance made it the reliable choice for the GPU cluster sizes that mattered until recently. The conventional wisdom was that Ethernet, despite its cost and vendor diversity advantages, could not match InfiniBand’s performance for tightly synchronised AI training workloads. MRC directly challenges that conventional wisdom by addressing the specific failure modes, congestion characteristics, and latency variability that made Ethernet less suitable than InfiniBand for large-scale training.
Boujelbene stated directly that MRC reinforces Ethernet’s growing position in hyperscale AI infrastructure, and that historically, ultra-large training clusters were dominated by Nvidia’s InfiniBand, but Ethernet is now rapidly evolving into a serious foundation for the largest AI supercomputers. That shift has direct revenue implications for Nvidia. InfiniBand is a significant and high-margin component of Nvidia’s networking business. A market that increasingly solves its AI networking challenges through MRC-enabled Ethernet rather than InfiniBand is a market that reduces its dependence on Nvidia networking hardware while potentially increasing its dependence on Nvidia compute hardware. As covered in our analysis of the custom silicon AI accelerator race entering its most consequential phase, the competitive dynamics around AI hardware are shifting in ways that compound over time. MRC is the networking dimension of that shift.
What Infrastructure Operators Need to Understand
For data center operators, colocation providers, and enterprise AI infrastructure teams, MRC has practical implications that extend beyond the protocol itself into how AI infrastructure gets designed, procured, and operated. The shift from single-path RDMA networking to multipath packet spraying changes the requirements for switches, network interface cards, and the optical connectivity infrastructure that carries AI training traffic. Facilities designed around conventional three-layer network topologies face the same architectural pressure that Google’s Virgo fabric addresses on the compute side: the general-purpose network design that served previous generations of workloads is not the optimal design for the specific traffic patterns that AI training generates.
MRC’s open standard status means that data center operators can plan around it with confidence that hardware support will be broadly available rather than dependent on a single vendor’s product roadmap. MRC extends RDMA over Converged Ethernet and draws on techniques from the Ultra Ethernet Consortium while adding SRv6-based source routing for large-scale AI networking fabrics. The technical specification is public. The OCP governance structure ensures the collective needs of the industry, rather than any single vendor’s competitive interests, will drive MRC’s evolution. Operators designing AI data center networking infrastructure today should incorporate MRC support into their hardware specifications rather than wait for commodity equipment vendors to standardise it, because the clusters that need MRC most are already under construction, not arriving two years from now.
As covered in our analysis of the AI infrastructure workforce crisis nobody is planning for, the skills required to design, implement, and operate next-generation AI infrastructure are in critically short supply. MRC adds a new networking design competency to that list that most infrastructure teams are not yet building.
