October 30, 2024

VP of Product Marketing

OCP Global Summit – AI, Cooling, and Cool Networking Innovations

The OCP Global Summit took place a couple of weeks ago in San Jose, California. As usual, it was a great place to catch up on market and technology trends and innovations. 

Not surprisingly, the main topic discussed at this event was AI. 

OCP Global Summit – AI, Cooling, and Cool Networking Innovations
Getting your Trinity Audio player ready...

The interesting thing, though, is that the most notable topic related to AI was the effort to optimize the infrastructure supporting AI. On that, two topics were dominant – physical infrastructure (predominantly power and cooling) and networking. 

AI infrastructure – the Meta perspective 

In the keynote speech of Omar Baldonado, Director of Engineering, Network Infra at Meta, he highlighted the main elements in Meta’s open systems for AI vision. As mentioned, the main two topics tackled were physical infrastructure, with the introduction of the Catalina architecture for a 140kW(!) power rack, and networking. 

In networking, Meta announced FBNIC, a Meta-designed network ASIC, a couple of 51T switches (with Broadcom and Cisco ASICs) and their next-generation AI fabric – DSF (Disaggregated Scheduled Fabric).

OCP global summit – AI, Cooling, and Cool networking innovations

Source: Meta keynote at OCP Global Summit, 2024 

To learn more download the eGuide
AI Cluster Refderence Design

A closer look at Meta’s DSF 

If the concept of DSF looks familiar to you, this is no coincidence. DSF is an implementation of the OCP DDC (Distributed Disaggregated Chassis) architecture that is used in multiple use cases in service providers’ and AI infrastructure networks. 

Meta, together with Broadcom, provided additional details regarding this architecture in a dedicated session entitled “Evolving FBOSS for the Next-Gen AI Fabric.” In that session, the main principals of DSF were described: 

  • Near-optimal load balancing 
  • Credit allocation providing smoother bandwidth delivery 
  • Fabric performing spraying/reassembly, giving flexibility/optionality for endpoints 

Meta also shared a comparison between DSF and non-scheduled Ethernet performance: 

  • DSF gains 10+% in all-to-all (BW-intensive collective) 
  • DSF at par for non-BW-intensive collectives – allreduce, allgather and reduce-scatter 

OCP global summit – AI, Cooling, and Cool networking innovations-Picture2

DSF topology, as presented by Meta and Broadcom 

Additional DDC performance information 

Not only Meta talked about the DDC/DSF architecture. There were other sessions that brought some notes from field deployments and testing while highlighting performance and convergence figures. 

Two notable ones were the “Insights From Production: Scheduled Ethernet Fabric in Large AI Training Clusters” session, presented by ByteDance and Broadcom, and the “Congestion Management in an Ethernet Based Network for AI Cluster Fabric” session, presented by Accton and… yours truly. 

DDC evolution – NCCM  

Talking about DDC, it’s worth mentioning that this architecture does not stand still. In its service provider context, a major contribution was accepted by the OCP during the event. OCP’s “25/100G NCCM Router Specifications” document, contributed by AT&T Labs, UfiSpace and DriveNets, introduces a converged control and management architecture. In this architecture, a new module – the NCCM – replaces the NCC and NCM modules of the “classic” DDC architecture, thus significantly simplifying network architecture. 

There were, of course, many other topics and discussion at this year’s Global Summit. You can view recorded sessions and their slides at this website: https://www.opencompute.org/events/past-events/2024-ocp-global-summit. 

Download the eGuide

AI Cluster Reference Design

Read more