AI Fabric Overview
For large clusters bundling hundreds or thousands of GPUs, the networking (or fabric) part of AI clusters, back-end networking is a crucial element, impacting the overall performance of the clusters and the efficient utilization of their compute resources. DriveNets’ AI fabric offers the highest-performance Ethernet-based DDC scheduled fabric as a strong alternative to InfiniBand for the back-end network of large-scale GPU clusters.

InfiniBand Drawbacks

Traditionally, InfiniBand has been the technology of choice for AI fabric as it provides excellent performance for these kinds of applications.

InfiniBand drawbacks:

  • Practically, a vendor-locked solution (controlled by Nvidia)
  • Relatively expensive
  • Requires a specific skillset and several fine-tuning efforts for each type of workload running on the cluster

Ethernet Challenges

The obvious alternative to InfiniBand is Ethernet. Yet Ethernet is, by nature, a lossy technology that results in higher latency and packet loss, and cannot provide adequate performance for large clusters.

  • The Ultra Ethernet Consortium (UEC) aims to resolve Ethernet’s drawbacks by adding congestion control and quality-of-service mechanisms to the Ethernet standards.
  • The emerging Ultra Ethernet standard, whose first version release is expected in late 2024, will allow hyperscalers and enterprises to use Ethernet with less performance compromise.

Fabric-Scheduled Ethernet

Chassis or Clos AI fabric? What about both?
Distributed Disaggregated Chassis (DDC-AI) is the most proven architecture for building Ethernet-based open and congestion-free fabric for high-scale AI clusters.

Fabric-Scheduled Ethernet offers:

  • Standard Ethernet
  • AI scheduled fabric – predictable lossless cluster back-end network
  • Proven performance at par with InfiniBand
  • Highest AI scale (up to 32K GPUs at 800Gbps)
  • Up to 30%JCT improvement compared to standard Ethernet Clos
  • Over $50M total cost of ownership (TCO) reduction for an 8K-GPU cluster
  • No vendor lock – supports any GPU, any NIC


DDC Scheduled Fabric– The Best Performing Solution

The best solution, in terms of both performance and cost, is the Distributed Disaggregated Chassis (DDC) scheduled fabric:

  • Not vendor-locked
  • Does not require heavy lifting of SmartNICs
  • Makes the AI infrastructure lossless and predictable without requiring additional technologies to mitigate the congestion

The costs of NIC-based solutions

Ultra Ethernet, however, relies on algorithms running on the edges of the fabric, specifically on the smart network interface cards / controllers (SmartNICs) that reside in the GPU servers.

This means:

  • A heavier compute burden on those SmartNICs, higher costs
  • Greater power consumption

For instance, take a move from the ConnectX-7 NIC (a more basic NIC, even though it is considered a SmartNIC) to the BlueField-3 SmartNIC (also called a data processing unit or DPU); this translates into a ~50% higher cost (per end device) and a threefold growth in power consumption.

This is also the case with another alternative to InfiniBand coming from Nvidia, the Spectrum-X solution (based on their Spectrum-4 and future Spectrum-6 ASICs).

  • Another Ethernet-based solution (like that of the UEC) that resolves congestion at the end devices
  • Also locked to Nvidia as a vendor

AI Cluster Reference Design Guide

Related Content
Building an 8K GPU Cluster with High-Performance Ethernet Connectivity

Let’s go under the hood of DriveNets Network Cloud-AI…When building a large GPU cluster for artificial intelligence (AI) training purposes,...

Read more
Scala Computing DriveNets Simulations External Report

Independent testing by the leading scalable data center simulation lab Scala Computing validates that Network Cloud-AI improves Job Completion Time...

Read more
Season 4 Ep 4: Comparing the industry’s leading scheduled fabrics

What's the difference between DDC, DES, and DSF? ...

Read more