May 23, 2023

VP of Product Marketing

AI Back-End Network Bottleneck Resolved by Network Cloud-AI: The Next Network Bottleneck – Part 3

As established in our previous blog post, the AI back-end network has some unique requirements derived from its: huge scale (1000s of 400/800Gbps ports in a single cluster),  online nature (that benefits from the same technology at the front end and the back end), goal of optimizing its compute (GPU) resources and yielding fastest JCT (job completion time) performance

AI Back-End Network Bottleneck Resolved by Network Cloud-AI: The Next Network Bottleneck – Part 3

If we try to categorize those requirements, according to which we can evaluate different solutions, the following three categories emerge as essential:

Architectural flexibility

  • Multiple and diverse applications
  • Support of growth
  • Web connectivity (unlike isolated HPC)

High performance at scale

  • Support of growth
  • Huge-scale GPU deployment larger than chassis limit
  • Fastest JCT via
    • Resilience
    • High availability
    • Minimal blast radius, etc.
    • Predictable lossless, low-latency and low-jitter connectivity – reducing GPU idle cycles

Trusted ecosystem

  • Standard interfaces allowing multi-vendor mix-and-match – avoiding HW/ASIC vendor lock
  • Field-proven interconnect solutions – reducing risk

Industry solutions for AI and their drawbacks

There are several notable industry solutions for AI back-end networking:

Non-Ethernet (e.g., Nvidia’s InfiniBand)

This semi-proprietary, non-Ethernet solution provides excellent performance as a lossless, predictable architecture, which leads to adequate JCT performance. On the other hand, it practically leads to a vendor lock, both on the networking level and on the GPU level. It also lacks the flexibility to promptly tune to different applications, requires a unique skill set to operate, and creates an isolated design that cannot be used in the adjacent front-end network.

AI-Networking-Presentation2-1

Ethernet – Clos architecture

Ethernet is the de facto standard in networking, which makes it very easy to plan and deploy. When built in a Clos architecture (with Tor leaves and chassis-based spines), it is practically unlimited in size. On the other hand, its performance degrades as the scale grows, and its inherent latency, jitter and packet loss cause GPU idle cycles and reduce JCT performance. It is also complex to manage in high scale, as each node (leaf or spine) is managed separately.

AI-Networking-Presentation2-2

Ethernet – Clos architecture with enhanced telemetry

Enhanced telemetry can somewhat improve Clos-architecture Ethernet solution performance via monitoring buffer/performance status across the network and proactively policing traffic. Having said that, such a solution still lacks the performance required for a large-scale AI network.

AI-Networking-Presentation2-3

Ethernet – single chassis

A chassis resolves the inherent performance issues and complexity of the multi-hop Clos architecture, as it reduces the number of Ethernet hops from any GPU to any GPU to one. However, it cannot scale as required, and also poses a complex cabling management challenge.

AI-Networking-Presentation2-4

Ethernet – Distributed Disaggregated Chassis (DDC)

Finally, the DDC solution offers the best of all worlds. It creates a single-Ethernet-hop architecture that is non-proprietary, flexible and scalable (up to 32,000 ports of 800Gbps). This yields workload JCT efficiency, as it provides lossless network performance while maintaining the easy-to-build Clos physical architecture. In this architecture, the leaves and spine are all the same Ethernet entity, and the fabric connectivity between them is cell-based, scheduled and guaranteed.

AI-Networking-Presentation2-5

DDC solution: the best fit for AI networking requirements

The following table summarizes the different solutions above according to the categories defined earlier:

Networking Bottlenecks Part 3

It is easy to see that the DDC solution is the best fit for AI networking requirements. While there are several vendors that claim to have a DDC-based solution, DriveNets Network Cloud-AI is the only one that is available and field-proven.

DriveNets AI Networking Solution

DriveNets Network Cloud for AI is the most innovative networking solution available today for AI. It maximizes the utilization of the AI infrastructures and substantially lowers their cost, in a standard-based implementation that doesn’t give up vendor interoperability.

Download White Paper

NCNF: From Network on a Cloud to Network Cloud

Read more