Image source- www.opencompute.org
Open Compute Project Accelerating Deployment of Next Gen AI Clusters
The Open Compute Project Foundation (OCP) is opening an AI portal on the OCP Marketplace. This site will become the one location for AI Cluster designers and builders to find the latest available AI Infrastructure products, white papers covering upcoming innovations and standardizations, best practice documents, and reference material needed to successfully design and build AI Clusters. At its inception the OCP Marketplace AI Portal already has many vendors showcasing their AI wares, a significant resource for AI cluster builders.Â
With hyperscale operators encountering unprecedented challenges with compute density, power distribution, interconnect and cooling, in building AI clusters composed of racks consuming as much as 1MW, OCP’s collaborative community of over 400 corporate members and 6,000 active engineers is developing open standards to address bottlenecks that threaten to constrain AI infrastructure growth.
“Looking ahead, OCP aims to remain the premier organization for AI infrastructure by focusing on three pillars: (1) standardizing silicon, power, cooling, and interconnects; (2) supporting complete open system development; and (3) providing education through technical workshops, the OCP Marketplace and Academy. As AI and HPC continue to redefine computing requirements, OCP’s role in fostering development of open, sustainable, and scalable infrastructure appears increasingly vital to the industry’s ability to deliver on AI’s transformative potential while managing its environmental impact,” said George Tchaparian, CEO at the Open Compute Project Foundation.
There are significant shared problems being worked by the OCP Community that include: standardizing rack architectures supporting power envelopes of 250 kW to 1 MW, defining advanced cooling solutions (e.g., liquid cooling) for high-density nodes, building high-voltage, high-efficiency power delivery systems, allowing for multiple, evolving scale-up and scale-out interconnect fabrics for performance, and comprehensive management frameworks for near-autonomous operations. The OCP Community with its Open Systems for AI strategic initiative endeavors to meet these challenges, and has recently published a Blueprint for Scalable AI Infrastructure and held a workshop on AI Physical Infrastructure .
Alongside the opening of the AI portal on the OCP Marketplace, Meta has completed its contribution of the specification for its Catalina AI Compute Shelf, which is specifically configured to deliver a high-density AI system that supports NVIDIA GB200. Catalina is ORv3 based supporting up to 140kW including the Meta Wedge fabric switches for the NVIDIA NVL72 architecture. This contribution by Meta complements the previous contribution by NVIDIA of its MGX-based GB200-NVL72 Platform covering (1) its reinforced OCP ORv3 rack architecture and (2) its 1RU liquid-cooled MGX compute and switch trays.Â
The OCP Open Systems for AI strategic initiative was launched January 2024, in recognition that AI is today’s market most prominent data center use case driving innovations, followed by HPC and the emerging edge. OCP’s greatest strength is its community-driven model. By uniting leaders, innovators, and experts from across the technology spectrum, OCP is tackling the multi-dimensional design challenges of AI infrastructure. This initiative brings together the work of the OCP Community to deliver the next generation of datacenters and IT equipment to meet AI’s scale and workload diversity.
“The AI capable data center build out is now in its third year, with 1st generation systems being deployed and the next generation on the drawing board. Due to the speed with which the market had to move, the 1st generation systems were mostly designed in silos resulting in higher costs due to fragmentations. It is the right time for an organization like the OCP to be facilitating a community to determine commonalities leading to standardizations that can help accelerate the market for future generations of AI cluster deployments,” said Ashish Nadkarni, Group Vice President and General Manager, Worldwide Infrastructure at IDC.
OCP and UALinkâ„¢ Consortium announce a new collaboration
The Open Compute Project Foundation and the Ultra Accelerator Linkâ„¢ (UALinkâ„¢) Consortium will collaborate to enhance scale-up interconnect performance in AI clusters and High-Performance Computing (HPC). The UALink Consortium is developing an open industry standard for high-performance accelerated compute scale-up interconnects tailored for AI and HPC workloads, while the OCP Community is actively designing sustainable, large-scale data center infrastructure with a focus on Open Systems for AI. Together, OCP and UALink aim to integrate UALink’s scale-up AI interconnect technology into OCP Community-delivered AI clusters, providing the high-bandwidth, low-latency, low-power connectivity required for high-performance AI training and inference.
“The rapid adoption of AI across industries, from autonomous systems to enterprise analytics, is driving unprecedented demand for scalable, high-performance AI infrastructure. This has created a pivotal moment for data center investments, with hyperscale operators deploying large-scale AI clusters to meet these needs. By collaborating, the UALink Consortium and the OCP Community can shape system specifications to address critical challenges in interconnect bandwidth and scalability posed by advanced AI models,” said George Tchaparian, CEO at the OCP Foundation.
Key aspects of the collaboration will focus on aligning OCP’s community-led infrastructure development with UALink’s interconnect innovations, ensuring seamless integration and shared objectives. The alliance will leverage the expertise of both organizations to advance scale-up AI interconnect performance. Following the release of UALink 1.0 Specification earlier this month, both organizations and their communities are setting up for collaboration across OCP’s Open Systems for AI Strategic Initiative and OCP’s Future Technologies Initiative Short-Reach Optical Interconnect workstream.
“AI and HPC workloads require ultra-low latency and massive bandwidth to handle the scale and complexity of accelerated compute data processing to meet LLM requirements. The UALink Consortium was formed to create an open standard for accelerated compute interconnects that meets these demands, enabling faster and more efficient data exchange. Partnering with the OCP Community will accelerate the adoption of UALink’s innovations into complete systems, delivering transformative performance for AI markets,” said Peter Onufryk, UALink Consortium President.
“The surge in generative AI and HPC applications is placing immense pressure on data center interconnects to deliver the bandwidth and responsiveness needed for training and inference. The alliance between OCP and UALink creates a powerful collaborative framework to develop and integrate advanced interconnect solutions, enhancing the performance of large-scale AI clusters. This alliance has the potential to redefine industry solutions for AI infrastructure,” said Sameh Boujelbene, VP at Dell’Oro Group.
