Fabric for AI Training Clusters – Should InfiniBand Be Your Default?

Getting your Trinity Audio player ready...

Things you (should) care about when selecting AI cluster fabric

There are several key indicators you should consider when selecting an AI cluster fabric:

1. High performance at scale

JCT: When it comes to several hundreds or thousands of GPUs in a cluster, optimizing JCT for training jobs becomes a non-trivial task that is crucial in terms of performance and cost optimization.

Resiliency: Another key aspect is resiliency. Specifically, a resilient high-availability, low-latency, no-jitter, and lossless environment is required to minimize GPU idle cycles.

Robustness: Seamless recovery from failures is more important than you might think. See how many network-related faults Meta encountered in their Llama 3.1 training in this article.
Consistent fabric performance: All of the above call for consistent fabric performance (in terms non-blocking bisectional bandwidth, latency, and packet loss). This means using some kind of packet-spraying or cell-spraying mechanisms at the endpoints or at the fabric itself for achieving perfect load distribution and avoiding elephant flow congestion.

2. Architectural flexibility/openness

To keep all options open, you should prefer an open ecosystem that will allow you to not only choose any GPU or NIC, but also run any workload on the cluster.

3. Trusted solution

To reduce risk, you should go with a trusted solution that has been implemented successfully.

And, as always, the are some costs considerations.

Download now!

Fabric-scheduled Ethernet as an effective backend interconnect for large AI compute clusters

Leading AI cluster fabric alternatives

Given the above considerations, it’s obvious that the network fabric is not something you want to choose with some kind of default approach.

Let’s look at the four main alternatives you can consider:

1. InfiniBand

Traditionally, InfiniBand (IB) has been the technology of choice for such a fabric as it provides excellent performance for these kinds of applications. IB, though, has its drawbacks. It is, practically, a vendor-locked solution (controlled by Nvidia), it is relatively expensive, and it requires a specific skillset and several fine-tuning efforts for each type of workload run on the cluster.

2. Standard (“classic”) Ethernet Clos

The obvious alternative is Ethernet. Yet Ethernet is, by nature, a lossy technology that results in higher latency and packet loss, and it cannot provide adequate performance for large clusters.

3. Endpoint-based and/or telemetry-based congestion-controlled (CC) Ethernet

There are two flavors here:

UEC v1.0: The standard is the Ultra Ethernet Consortium (UEC) v1.0 set of specifications, whose first version release is expected in late 2024. The UEC aims to resolve Ethernet-related drawbacks by adding congestion control and quality-of-service mechanisms to the Ethernet standards. The emerging Ultra Ethernet standard will allow hyperscalers and enterprises to use Ethernet with less performance compromise. This is done by utilizing a packet spraying mechanism at the network interface card/controller (NIC) or data processing unit (DPU) endpoints.

Proprietary: The second flavor is similar but proprietary-based. This includes the Spectrum-X solution from Nvidia, as well as similar solutions from, Cisco, Arista and Juniper.

Both Ultra Ethernet and proprietary solutions, however, rely on algorithms running on the edges of the fabric, specifically on the smart NICs (SmartNICs) that reside in the GPU servers. This means a heavier compute burden on those SmartNICs, higher costs, and greater power consumption. Take, for instance, a move from the ConnectX-7 NIC (a more basic NIC, even though it is considered a SmartNIC) to the BlueField-3 SmartNIC (also called a DPU); this will translate into a ~50% higher cost (per end device) and a threefold growth in power consumption.

4. Scheduled fabric-based congestion-avoidance Ethernet

The scheduled fabric solution has multiple names, like DriveNets DDC, Arista DES and Cisco DSF. All utilize an in-fabric spraying mechanism for perfect load balancing, without the need for endpoint alteration.

This is the top performing solution, completely avoiding congestion, while the alternatives (based on Ethernet, as well as InfiniBand) only monitor and mitigate congestion. The performance difference between this solution and others can be as large as 30% improvement in terms of Job Completion Time.

Scheduled fabric: the best solution for AI training clusters

The best solution, in terms of both performance and cost, is one that is not vendor-locked and does not require heavy-lifting of SmartNICs – namely scheduled fabric. This type of fabric makes the AI infrastructure lossless and predictable, without needing to bring additional technologies to mitigate congestion.

As mentioned, there are several implementations of scheduled fabric. Yet when it comes to a field-proven, ready-for-deployment solution, you are better off going with the DriveNets Network Cloud-AI solution based on Distributed Disaggregated Chassis (DDC).

DriveNets Network Cloud-AI is based on the already massively deployed Network-Cloud solution, serving as the main routing platform for multiple global tier-1 operators.

Network Cloud-AI is already deployed in the production environment at some hyperscaler AI training clusters.

The table below summarizes the pros and cons of the different AI cluster fabric options:

Fabric for AI Training Clusters – Should InfiniBand Be Your Default

Fabric for AI Training Clusters – Should InfiniBand Be Your Default?

Things you (should) care about when selecting AI cluster fabric

Leading AI cluster fabric alternatives

Scheduled fabric: the best solution for AI training clusters

Related content for AI networking architecture