DriveNets Network Cloud-AI Fabric Solution

DriveNets Network Cloud-AI offers the highest-performance lossless Ethernet solution for AI networking back-end fabric. Its performance was tested and shown to equal that of InfiniBand, yet with standard Ethernet.

DriveNets Network Cloud-AI:

  • based on Distributed Disaggregated Chassis (DDC) scheduled fabric architecture
  • boosts job completion time (JCT) performance by up to 30% compared to standard Ethernet Clos
  • optimizes GPU resource utilization
  • ensures interoperability via standard Ethernet
  • provides the flexibility of vendor choice

ByteDance deployed the world’s first 1K GPGPU production cluster powered by DDC scheduled Ethernet fabric in July 2024. The cluster handles a mixture of inference and training traffic from various applications. ByteDance’s existing operational toolkits, designed for non-scheduled fabrics, were easily ported to this cluster. The cluster has demonstrated excellent performance, as expected, and provided a smooth user experience.

Network Cloud-AI diagram

AI Fabric Overview

AI Fabric and JCT

AI fabric is the networking instance that connects the graphics processing units (GPUs) in a training or inference GPU cluster.

  • Needs to be predictable and lossless
  • Any hiccup in connectivity between GPUs significantly degrades the cluster and its workload performance in terms of job completion time (JCT)

InfiniBand Drawbacks

Traditionally, InfiniBand has been the technology of choice for AI fabric as it provides excellent performance for these kinds of applications.

InfiniBand drawbacks:

  • Practically, a vendor-locked solution (controlled by Nvidia)
  • Relatively expensive
  • Requires a specific skillset and several fine-tuning efforts for each type of workload running on the cluster

Ethernet Challenges

The obvious alternative to InfiniBand is Ethernet. Yet Ethernet is, by nature, a lossy technology that results in higher latency and packet loss, and cannot provide adequate performance for large clusters.

  • The Ultra Ethernet Consortium (UEC) aims to resolve Ethernet’s drawbacks by adding congestion control and quality-of-service mechanisms to the Ethernet standards.
  • The emerging Ultra Ethernet standard, whose first version release is expected in late 2024, will allow hyperscalers and enterprises to use Ethernet with less performance compromise.

NIC-based solutions

Ultra Ethernet, however, relies on algorithms running on the edges of the fabric, specifically on the smart network interface cards / controllers (SmartNICs) that reside in the GPU servers.

This means:

  • A heavier compute burden on those SmartNICs, higher costs
  • Greater power consumption

For instance, take a move from the ConnectX-7 NIC (a more basic NIC, even though it is considered a SmartNIC) to the BlueField-3 SmartNIC (also called a data processing unit or DPU); this translates into a ~50% higher cost (per end device) and a threefold growth in power consumption.

This is also the case with another alternative to InfiniBand coming from Nvidia, the Spectrum-X solution (based on their Spectrum-4 and future Spectrum-6 ASICs).

  • Another Ethernet-based solution (like that of the UEC) that resolves congestion at the end devices
  • Also locked to Nvidia as a vendor

DDC Scheduled Fabric– The Best Performing Solution

The best solution, in terms of both performance and cost, is the Distributed Disaggregated Chassis (DDC) scheduled fabric:

  • Not vendor-locked
  • Does not require heavy lifting of SmartNICs
  • Makes the AI infrastructure lossless and predictable without requiring additional technologies to mitigate the congestion

Network Cloud-AI Solution Benefits

Network Cloud-AI white paper

DriveNets Network Cloud-AI supports up to 32,000 GPUs (with 800Gbps connections) in a single cluster. With InfiniBand-level reliable connectivity, low latency, and practically zero jitter using the DDC cell-based, scheduled fabric technology, the solution maximizes network utilization and improves JCT performance by up to 30% compared to other Ethernet solutions. Moreover, it does not require expensive and power-hungry DPUs

DriveNets Network Cloud-AI offers an open architecture with high performance, adapting to changing models and network requirements. It ensures interoperability through Ethernet and remains vendor-agnostic across all hardware domains, allowing the use of any GPU and SmartNIC/DPU.

DriveNets is the sole vendor with a proven implementation of a large-scale scheduled fabric. Scheduled fabric is recognized as the highest performance solution by both Arista (DES) and Cisco (DSF). While their solutions are at early stages of deployment, DriveNets Network Cloud has powered the world’s largest DDC network for more than five years. DriveNets also has demonstrated remarkable AI workload performance and scalability in production implementations and field trials conducted with top hyperscalers.

Distributed Disaggregated Chassis for AI (DDC-AI)


Chassis or Clos AI fabric? What about both?
Distributed Disaggregated Chassis (DDC-AI) is the most proven architecture for building Ethernet-based open and congestion-free fabric for high-scale AI clusters.

DDC-AI offers:

  • Standard Ethernet
  • AI scheduled fabric – predictable lossless cluster back-end network
  • Proven performance at par with InfiniBand
  • Highest AI scale (up to 32K GPUs at 800Gbps)
  • Up to 30%JCT improvement compared to standard Ethernet Clos
  • Over $50M total cost of ownership (TCO) reduction for an 8K-GPU cluster
  • No vendor lock – supports any GPU, any NIC

Related Content
InfiniBand vs Ethernet – Why Ethernet fits AI Networking needs

Artificial intelligence (AI) has revolutionized various industries, driving the need for efficient networking solutions to support the massive data demands...

Read more
Meeting the Challenges of the AI-Enabled Datacenter​:
Reduce Job Completion Time in AI Clusters

Distributed Disaggregated Chassis guarantees lossless connectivity for a large-scale server array running high-bandwidth workloads free of flow discrimination and with...

Read more
Season 3 Ep 3: Solutions for Challenges in AI Networking

Today we're going to talk again about AI networking, and we will provide the solutions for the challenges we mentioned...

Read more