AI Back-end Networking Fabric

InfiniBand Drawbacks

Traditionally, InfiniBand has been the technology of choice for AI fabric as it provides excellent performance for these kinds of applications.

InfiniBand drawbacks:

Practically, a vendor-locked solution (controlled by Nvidia)
Relatively expensive
Requires a specific skillset and several fine-tuning efforts for each type of workload running on the cluster

Learn more

Learn more about

the highest performance AI Ethernet fabric – InfiniBand alternative

	Proprietary	Ethernet
Solutions	InfiniBand & others	Clos Topology	Clos with Enhanced Telemetry	Single Chassis	DDC (Non DriveNets)	DriveNets Network Cloud-AI
Architectural Flexibility	Low Different technology for front-end and back-end	High	High	High	High	High Seamless internet connectivity – Single technology for back-end and front-end Well-known protocol – 600M Ethernet ports per year Support of multiple applications Supports growth ASIC and ODM agnostic
Performance at Scale	High	Low Poor Clos performance	Low-Medium Medium Clos performance Poor chassis scalability	Low Poor chassis scalability	High	High Up to 32Kx800Gbps Cell-based fabric 10-30% improved JCT performance: may lead to 100% system ROI as networking is 10% of the system cost
Trusted Ecosystem	Medium Closed solution, ASIC and HW vendor lock	Medium – High Typically, not an open solution (vendor lock)	Medium Not an open solution (vendor lock)	Low Vendor lock	Low Not field proven Not open	High Based on a certified OCP concept Powering the world’s largest DDC Network (>52% of AT&T) Performance proven by US/CN hyperscalers

Ethernet Challenges

The obvious alternative to InfiniBand is Ethernet. Yet Ethernet is, by nature, a lossy technology that results in higher latency and packet loss, and cannot provide adequate performance for large clusters.

The Ultra Ethernet Consortium (UEC) aims to resolve Ethernet’s drawbacks by adding congestion control and quality-of-service mechanisms to the Ethernet standards.
The emerging Ultra Ethernet standard, whose first version release is expected in late 2024, will allow hyperscalers and enterprises to use Ethernet with less performance compromise.

Learn more

Fabric-Scheduled Ethernet

Chassis or Clos AI fabric? What about both?
Distributed Disaggregated Chassis (DDC-AI) is the most proven architecture for building Ethernet-based open and congestion-free fabric for high-scale AI clusters.

Fabric-Scheduled Ethernet offers:

Standard Ethernet
AI scheduled fabric – predictable lossless cluster back-end network
Proven performance at par with InfiniBand
Highest AI scale (up to 32K GPUs at 800Gbps)
Up to 30%JCT improvement compared to standard Ethernet Clos
Over $50M total cost of ownership (TCO) reduction for an 8K-GPU cluster
No vendor lock – supports any GPU, any NIC

Learn more

Lossless Network for AI

The best solution, in terms of both performance and cost, is the Fabric-Scheduled Ethernet:

Not vendor-locked
Does not require heavy lifting of SmartNICs
Makes the AI infrastructure lossless and predictable without requiring additional technologies to mitigate the congestion

Learn more

The costs of NIC-based solutions

Ultra Ethernet, however, relies on algorithms running on the edges of the fabric, specifically on the smart network interface cards / controllers (SmartNICs) that reside in the GPU servers.

This means:

A heavier compute burden on those SmartNICs, higher costs
Greater power consumption

For instance, take a move from the ConnectX-7 NIC (a more basic NIC, even though it is considered a SmartNIC) to the BlueField-3 SmartNIC (also called a data processing unit or DPU); this translates into a ~50% higher cost (per end device) and a threefold growth in power consumption.

This is also the case with another alternative to InfiniBand coming from Nvidia, the Spectrum-X solution (based on their Spectrum-4 and future Spectrum-6 ASICs).

Another Ethernet-based solution (like that of the UEC) that resolves congestion at the end devices
Also locked to Nvidia as a vendor