If we try to categorize those requirements, according to which we can evaluate different solutions, the following three categories emerge as essential:
Architectural flexibility
- Multiple and diverse applications
- Support of growth
- Web connectivity (unlike isolated HPC)
High performance at scale
- Support of growth
- Huge-scale GPU deployment larger than chassis limit
- Fastest JCT via
- Resilience
- High availability
- Minimal blast radius, etc.
- Predictable lossless, low-latency and low-jitter connectivity – reducing GPU idle cycles
Trusted ecosystem
- Standard interfaces allowing multi-vendor mix-and-match – avoiding HW/ASIC vendor lock
- Field-proven interconnect solutions – reducing risk
Industry solutions for AI and their drawbacks
There are several notable industry solutions for AI back-end networking:
Non-Ethernet (e.g., Nvidia’s InfiniBand)
This semi-proprietary, non-Ethernet solution provides excellent performance as a lossless, predictable architecture, which leads to adequate JCT performance. On the other hand, it practically leads to a vendor lock, both on the networking level and on the GPU level. It also lacks the flexibility to promptly tune to different applications, requires a unique skill set to operate, and creates an isolated design that cannot be used in the adjacent front-end network.
Ethernet – Clos architecture
Ethernet is the de facto standard in networking, which makes it very easy to plan and deploy. When built in a Clos architecture (with Tor leaves and chassis-based spines), it is practically unlimited in size. On the other hand, its performance degrades as the scale grows, and its inherent latency, jitter and packet loss cause GPU idle cycles and reduce JCT performance. It is also complex to manage in high scale, as each node (leaf or spine) is managed separately.
Ethernet – Clos architecture with enhanced telemetry
Enhanced telemetry can somewhat improve Clos-architecture Ethernet solution performance via monitoring buffer/performance status across the network and proactively policing traffic. Having said that, such a solution still lacks the performance required for a large-scale AI network.
Ethernet – single chassis
A chassis resolves the inherent performance issues and complexity of the multi-hop Clos architecture, as it reduces the number of Ethernet hops from any GPU to any GPU to one. However, it cannot scale as required, and also poses a complex cabling management challenge.
Ethernet – Distributed Disaggregated Chassis (DDC)
Finally, the DDC solution offers the best of all worlds. It creates a single-Ethernet-hop architecture that is non-proprietary, flexible and scalable (up to 32,000 ports of 800Gbps). This yields workload JCT efficiency, as it provides lossless network performance while maintaining the easy-to-build Clos physical architecture. In this architecture, the leaves and spine are all the same Ethernet entity, and the fabric connectivity between them is cell-based, scheduled and guaranteed.
DDC solution: the best fit for AI networking requirements
The following table summarizes the different solutions above according to the categories defined earlier:
It is easy to see that the DDC solution is the best fit for AI networking requirements. While there are several vendors that claim to have a DDC-based solution, DriveNets Network Cloud-AI is the only one that is available and field-proven.
DriveNets Network Cloud for AI is the most innovative networking solution available today for AI. It maximizes the utilization of the AI infrastructures and substantially lowers their cost, in a standard-based implementation that doesn’t give up vendor interoperability.
Download White Paper
NCNF: From Network on a Cloud to Network Cloud