July 30, 2024

VP of Product Marketing

A High-Performance Back-End Ethernet Fabric for AI GPU Clusters

There’s an exciting race going on right now – and I’m not referring to a sporting event at the Olympics. The race under discussion is all about developing a high-performance back-end Ethernet fabric for artificial intelligence (AI) GPU clusters.

The backend networking fabric of an AI GPU cluster is different from most networking segments you typically encounter. That’s because this piece of networking has a dramatic effect on the performance of the GPU cluster, which, in turn, effects the business case and ROI of such a cluster. 

A High-Performance Back-End Ethernet Fabric for AI GPU Clusters

As the back-end fabric carries remote direct memory access (RDMA) traffic, any network hiccup, such as packet loss, failover or even jitter, results in one of two negative outcomes. 

The first outcome is when a graphics processing unit (GPU), or group of GPUs, stand idle, awaiting those networking resources and the data needed for continued job processing. The second outcome is when the job is broken and needs to go back to the last checkpoint and redo a part of the job. 

Both outcomes increase the job completion time (JCT), which is the key parameter we want to minimize in order to best utilize the extremely expensive GPU resources.

When it comes to large clusters, which bundle 100s and 1000s of GPUs and are used predominantly for training, the use of an Ethernet backend fabric is prone to packet loss and jitter. This is due to the nature of such RDMA traffic, which creates “elephant flows” that basic Ethernet load-balancing mechanisms, such as equal-cost multipath (ECMP), simply cannot handle. 

The InfiniBand vs Ethernet dilemma

Traditionally, InfiniBand has been the technology of choice for such a fabric as it provides excellent performance for these kinds of applications. InfiniBand, though, has its drawbacks. It is, practically, a vendor-locked solution (controlled by Nvidia), it is relatively expensive, it calls for a specific skillset, and it requires several fine-tuning efforts for each type of workload run on the cluster fabric. 

The obvious alternative is Ethernet. Yet Ethernet is, by nature, a lossy technology and, as mentioned, lacks in performance. The race is on, therefore, to create an Ethernet fabric that will provide performance on par with, or even better than, InfiniBand.

As in most races, there is more than one contender to achieve the goal. While all alternatives aim to resolve the congestion in the network, they do so with totally different approaches. 

The challenges of the network endpoint approach 

In order to overcome elephant flow-driven congestion, a new type of load balancing is required. This could be achieved with the network endpoints being aware of the current congestion heatmap, and then spraying packets into the fabric accordingly.  

While this is less trivial than it sounds, of course, there are some initiatives that have taken this approach. Those include the Nvidia Spectrum-X solution, additional vendor-driven solutions, and, most notably, the standard-based Ultra Ethernet Consortium (UEC) endeavor. 

There are two main challenges with the endpoint approach. 

First, it is, by nature, a reactive approach. That is, congestion happens and then it is mitigated. That means that the performance gain is limited. Though very high performance can be achieved, it is still mitigating congestion and not avoiding it altogether. 

Second, this kind of mechanism requires very smart endpoints. In our case, the endpoints are the servers’ network interface cards (NICs). Those smart(er) NICs are more expensive (compare the Nvidia BlueField-3 to its ConnectX-7 and you will see around 50% cost increase per endpoint). Perhaps even more importantly, the higher computing power required at the endpoints also means higher power consumption there – up to 3 times higher. 

The new fabric approach 

This approach changes the fabric and does not rely on endpoints to mitigate congestion.  

The interfaces towards the endpoints are standard Ethernet interfaces without enhancements, while the fabric’s internal interfaces run a non-Ethernet mechanism (packet- or cell-based) that sprays cells across the entire fabric to ensure perfectly equal load balancing. This, combined with a virtual output queue (VoQ) mechanism and grant-based flow control, allows a non-blocking, congestion-free fabric, just like the backplane of a chassis.  

In fact, these are the same mechanisms that are used in a chassis fabric – hence the name Distributed Disaggregated Chassis (DDC). DDC is the Open Compute Project (OCP) standard that defines this architecture. Other similar applications include Arista’s Distributed Etherlink Switch (DES) and Cisco’s Disaggregated Scheduled Fabric (DSF). 

While the challenge of this approach is the need for an entirely new fabric, its benefits are clear. Most significantly, it can lead to over 30% performance improvement, in terms of JCT, compared to standard Ethernet. Learn more about DDC for AI fabric applications here. 

Download white paper

Utilizing Distributed Disaggregated Chassis (DDC) for Back-End AI Networking Fabric

Read more