March 20, 2024

Founder & Chief Analyst, Futuriom

Why DriveNets Can Lead in Ethernet-based AI Networking

AI is dominating the discussions in the cloud infrastructure markets, and it’s for a good reason. AI may be the largest new technology cycle since the arrival of the smartphone.  

This secular trend will have a large impact on technology infrastructure, because the demands of AI are different. AI workloads and data exchange increase the demand for lossless, jitter less, low-latency and high-bandwidth networking. They also drive changes to networking architectures, as AI workloads change the dynamics of how data flows among servers and AI processing. 

Why DriveNets Can Lead in Ethernet-based AI Networking
Getting your Trinity Audio player ready...

How do you support the high-performance needs of AI

InfiniBand, the specialized high-bandwidth technology by Nvidia, is frequently used with AI systems but is considered almost proprietary. Ethernet is emerging as the preferred choice for AI networking (even Nvidia hedged building an Ethernet-based solution), but will require some optimization to support the high-performance needs of AI. That was the goal behind the Ultra Ethernet Consortium (UEC) — established by leading industry players like AMD, Broadcom, cisco, Intel, Meta and others, to pave a path towards an Ethernet-based congestion-free fabric. But if you are building an AI infrastructure today, and you are looking for an Ethernet-based solution, the number of relevant options available right now is very small.

DriveNets offers one of these options. It’s Ethernet-based solution works today in high-scale AI Infrastructures. It is based on the company’s field-proven scheduled-fabric that is part of DriveNets Network Cloud solution.

Ethernet has a big opportunity in AI

One of the large potential impacts on networks from AI is the need for new architectures. The tried-and-true leaf-and-spine architecture popularized by webscale providers will not be abandoned, but is not optimized for AI.

Expect this debate on AI networking architectures to expand and draw demand for a wider variety of solutions. NVIDIA, the leader in AI GPUs and systems, has currently developed an advanced vertically integrated system for Large Language Models (LLMs), and it favors integrating this with systems it controls such as the NVLink interconnect and InfiniBand technology for connecting AI processing and servers.

But these systems address just one application for AI, and the reality is we don’t know how the market will evolve. As additional use cases develop, some customers will look at alternative solutions with more open characteristics based on open standards and commercial off-the-shelf (COTS) technology. As Ethernet standards evolve for AI, there will be many opportunities to support specialized LLMs as well as distributed edge AI.

The history of networking and other markets has always presented see-saw battles between integrated, proprietary systems and standards-based solutions. Standards-based solutions often win because of their characteristics of driving better economics and creating larger markets.

Wall Street analyst Simon Leopold, with financial analyst Raymond James, recently wrote a comprehensive note on AI system evolution called the “AI Bandwidth Opportunity: Impact on the Optical Market.” In that research note, Leopold estimates that scale-out architectures are likely to play a greater role as Ethernet solutions mature – and that they could become the go-to architecture for AI Inferencing. Inferencing, which is especially relevant to the edge and access markets, holds the potential to be a market equal to or even larger than LLMs.

Ethernet has a long history of success, so it’s probably going to be a mistake to discount it.

DriveNets ethernet-based AI scheduled fabric

The next Ethernet will need to be adapted to support AI, and this is where it gets interesting. DriveNets has already started addressing the market with Network Cloud-AI, initially announced in May 2023. The Network Cloud-AI has the advantage of being based on open networking standards, yet being optimized for AI and, most importantly – being ready for field deployment. Today.

DriveNets says that using an Ethernet Clos architecture, which aggregates switching ports into a traditional networking backbone, results in performance hits. Instead, it’s using a scheduled, cell-based fabric on its infrastructure backbone (i.e., the connectivity between leaf and spine nodes). This architecture is based on the Distributed Disaggregated Chassis (DDC) specification supported by the Open Compute Foundation (OCP), which DriveNets has built its Network Cloud solution on. The OCP DDC offers an open architecture for a massively scalable software router or switch that aren’t dependent on a single proprietary hardware chassis. DriveNets Network Cloud-AI series runs on white boxes based on the Jericho2C+ Ethernet switch ASIC from Broadcom, creating a distributed, cloud-based network for demanding generative AI workloads.

DriveNets has also employed Broadcom’s new Jericho3-AI chip and Ramon 3 cell-based switching component, giving DriveNets’ DDC the ability to support 32,000 GPUs in a single cluster, each, connected with a 800-Gb/s Ethernet networking port. The scheduled fabric provides perfect load balancing and scheduling across the cluster which yields optimized performance, in terms of Job Completion Time (JCT), for deep AI training workloads. This fabric performs better than traditional Ethernet fabric switches, with lower latency and jitter and better reliability. Furthermore, Network Cloud-AI doesn’t need to be adjusted for specific LLMs, which DriveNets says can be an InfiniBand requirement.

An independent datacenter simulation lab, Scala Computing, set up a range of scenarios comparing Network Cloud-AI’s DDC with a leaf-spine Ethernet fabric. The results: DriveNets’ solution recently showed 10% to 30% improved job completion time (JCT) in a simulation of an AI training cluster with 2,000 GPUs. The improved JCT, DriveNets said, can lead to 100% system return on investment (ROI), since networking represents 10% of system cost.

DriveNets’ test results show that Ethernet can indeed become a viable standard for AI networking implementations. With the growth of AI infrastructure in a gestational stage, one should expect that Ethernet is going to play an expanding role in supporting AI.

This ongoing evolution of AI networking technologies and architectures will be one of the most exciting elements to watch as the AI wave unfolds.

FAQs for AI Networking

  • What technology is emerging to address the challenges of AI networking?
    Ethernet is emerging as the preferred choice for AI networking (even Nvidia hedged building an Ethernet-based solution), but will require some optimization to support the high-performance needs of AI.
  • What Ethernet solution does DriveNets offer?
    DriveNets offers an Ethernet-based solution that works today in high-scale AI Infrastructures. It is based on the company’s field-proven scheduled-fabric that is part of DriveNets Network Cloud solution.
  • What chipset does DriveNets use?
    DriveNets employs Broadcom’s new Jericho3-AI chip and Ramon 3 cell-based switching component, giving DriveNets’ DDC the ability to support 32,000 GPUs in a single cluster, each, connected with a 800-Gb/s Ethernet networking port.

Additional Resources for AI Networking

Download white paper

Distributed Disaggregated Chassis (DDC) as an Effective Interconnect for Large AI Compute Clusters

Read more