Getting your Trinity Audio player ready...
|
The interesting thing, though, is that the most notable topic related to AI was the effort to optimize the infrastructure supporting AI. On that, two topics were dominant – physical infrastructure (predominantly power and cooling) and networking.
AI infrastructure – the Meta perspective
In the keynote speech of Omar Baldonado, Director of Engineering, Network Infra at Meta, he highlighted the main elements in Meta’s open systems for AI vision. As mentioned, the main two topics tackled were physical infrastructure, with the introduction of the Catalina architecture for a 140kW(!) power rack, and networking.
In networking, Meta announced FBNIC, a Meta-designed network ASIC, a couple of 51T switches (with Broadcom and Cisco ASICs) and their next-generation AI fabric – DSF (Disaggregated Scheduled Fabric).
Source: Meta keynote at OCP Global Summit, 2024
A closer look at Meta’s DSF
If the concept of DSF looks familiar to you, this is no coincidence. DSF is an implementation of the OCP DDC (Distributed Disaggregated Chassis) architecture that is used in multiple use cases in service providers’ and AI infrastructure networks.
Meta, together with Broadcom, provided additional details regarding this architecture in a dedicated session entitled “Evolving FBOSS for the Next-Gen AI Fabric.” In that session, the main principals of DSF were described:
- Near-optimal load balancing
- Credit allocation providing smoother bandwidth delivery
- Fabric performing spraying/reassembly, giving flexibility/optionality for endpoints
Meta also shared a comparison between DSF and non-scheduled Ethernet performance:
- DSF gains 10+% in all-to-all (BW-intensive collective)
- DSF at par for non-BW-intensive collectives – allreduce, allgather and reduce-scatter
DSF topology, as presented by Meta and Broadcom
Additional DDC performance information
Not only Meta talked about the DDC/DSF architecture. There were other sessions that brought some notes from field deployments and testing while highlighting performance and convergence figures.
Two notable ones were the “Insights From Production: Scheduled Ethernet Fabric in Large AI Training Clusters” session, presented by ByteDance and Broadcom, and the “Congestion Management in an Ethernet Based Network for AI Cluster Fabric” session, presented by Accton and… yours truly.
DDC evolution – NCCM
Talking about DDC, it’s worth mentioning that this architecture does not stand still. In its service provider context, a major contribution was accepted by the OCP during the event. OCP’s “25/100G NCCM Router Specifications” document, contributed by AT&T Labs, UfiSpace and DriveNets, introduces a converged control and management architecture. In this architecture, a new module – the NCCM – replaces the NCC and NCM modules of the “classic” DDC architecture, thus significantly simplifying network architecture.
There were, of course, many other topics and discussion at this year’s Global Summit. You can view recorded sessions and their slides at this website: https://www.opencompute.org/events/past-events/2024-ocp-global-summit.
Download the eGuide
AI Cluster Reference Design