April 11, 2024

Head of Product Strategy

What’s happening with AI Networking & the UEC?

The Ultra Ethernet Consortium (UEC), part of the Linux Foundation’s Joint Development Foundation, was formed in July 2023 by a group of 9 mega-companies with the purpose of building a complete Ethernetbased solution for AI networking. 

What’s happening with AI Networking & the UEC?

In October 2023, the UEC added tens of additional companies to its members list, making it the fastest growing consortium in the history of the LF.

So, what is the UEC all about and where is it going? Here’s the personal view of a 25-year veteran of the telecommunications industry.

To learn more download the white paper
Utilizing Distributed Disaggregated Chassis (DDC) for Back-End AI Networking Fabric

What is the Ultra Ethernet Consortium about?

Arista, AMD, Broadcom, Cisco, Eviden, HPE, Intel, Meta, and Microsoft. The king and queen of CPU and GPU challengers, the king, prince and duke of Ethernet networking, a prime integrator to tie it all up, probably the two largest consumers of Ethernet equipment, and an international French player to make it a global effort. Any two of these names in a consortium is probably a good indication that it’s on the right path. Such a select group creates so much gravity that fear of missing out (FOMO) starts to kick in.

But seriously, there’s a real challenge here to tackle. And there’s a lot of gain on the table to be collected, both as a vendor and a consumer of the new technology.

Ethernet has always been the protocol of choice for telecommunications. Esoteric technologies like ATM, frame relay, Fibre Channel, InfiniBand and others had their 15 minutes of glory in certain niche applications. Yet it was always Ethernet that stepped up, evolved and outpaced these alternatives to maintain its reign as the king of networking.

What we see in artificial intelligence (AI) is something different. InfiniBand, a niche protocol implemented by a single vendor over the last decade, has within a year grown out of its niche by an order of magnitude and is not showing any signs of slowing down. It’s safe to say that Ethernet would have made the right steps towards overcoming InfiniBand in the AI space. Yet with Ethernet’s historical pace of change, it could take a decade to catch up. Given the speed of the AI market, I would not even risk guessing what will happen 10 years from now. So there is a clear call to accelerate Ethernet evolution – and the UEC is set to doing exactly that.

How is the Ultra Ethernet Consortium going?

From my “seasoned veteran” point of view, tens of engineers and architects, including some of the industry’s best, are putting in a real effort. The UEC is fully dedicated to building something that really works, is feasible to implement, is easily testable, solves the right problems, and gives a positive outcome towards the known and somewhat unknown future. All opinions are heard, every idea is investigated, and no stone is left unturned during the process of the working groups setting up the right definitions.

From the perspective of a software (SW) company, which is free to work with any ASIC vendor, original design manufacturer (ODM) or solution integrator, there is little concern about the definitions at this stage. Anything can be done in SW. However, much should be done in hardware if the UEC wants to achieve its performance and scale targets. I would expect discussions to be a bit more vocal from the hardware vendors, as each is trying to pull the definitions in its direction.

Eventually, definitions will be put in place and ASIC vendors will follow through with implementation.

Well, so far so good. We seem to have the right companies allocating the right people who are working in the right direction. That gives us every reason to believe that we will see the UEC bringing Ethernet “back to glory” within the AI networking space in no time.

The concern is that while the consortium is very large, it is still missing a few key players…

What is the Ultra Ethernet Consortium not?

The UEC is not a standardization body. As I noted in the previous post, introducing a new Ethernet spec through the IETF or ITU could take a decade. Standardization bodies do, however, have an advantage over the UEC – their work is conducted in the open and not behind closed doors.

You might assume that any company wishing to become a member of the UEC is eligible. It could apply and would likely be granted membership. And yet, the UEC is aiming to solve AI networking – but the #1 vendor doing compute for AI workloads is missing. The UEC is all about Ethernet – but the #1 Ethernet NIC vendor is missing. The UEC is targeting data center Ethernet – but the #2 switch ASIC vendor is missing. These companies are Mellanox and Nvidia, and the reason they are absent is that they are both Nvidia.

Dubbed “the anti-Nvidia” consortium by several bloggers, the UEC has taken on another challenge, a political one in this case, of relieving the grip that Nvidia holds over the AI industry. As hard as it is to invent a new Ethernet spec, this “political putsch” is probably the biggest challenge facing the UEC.

Nvidia has business relationships with every one of the nine founding companies of the UEC and with many of the additional members as well. Whatever actions Nvidia is taking to create this dominance are not frowned upon by any of these companies as they are considered good business decisions. It’s just that Nvidia’s dominance is so extreme, and the financial stakes so high, that the industry cannot allow this to continue.

So when will the Ultra Ethernet Consortium be implemented?

All definitions need to be settled. Agreements over every dispute need to be ironed out. Implementation into silicon will require a cycle, which has a 6-month penalty on manufacturing alone. Then comes SW, network operating systems (NOSs), control and management systems, testing, and productization. I don’t want to play the role of a prophet, but I can estimate that we’ll see hundreds of thousands of GPUs deployed before we see the first implementation of UEC specs in production. I am not being pessimistic – just look at Nvidia’s earning reports and see for yourself.

Now what for Ethernet?

In line with the strategy to break the monopoly over the long term, tactics need to follow that will fill in the gaps for the coming years. Ethernet components do exist from various vendors besides Nvidia.

Grouping such components into a solution could mean some level of compromise on performance, solution tuning, management overhead, or skillset gaps. What appears to be the most painful compromise is related to performance. A GPU utilized to 50% (for whatever reason) in fact costs like two GPUs. And with the raging GPU prices of this monopolized industry, this performance attribute grabs the most attention.

One Ethernet alternative to InfiniBand stands out when it comes to performance. Often referred to as distributed scheduled fabric (DSF), it is a Clos topology consisting of packet forwarder (acting as top-of-rack, or ToR) and fabric (acting as spine). It was originally defined about six years ago at the Open Compute Project (OCP) under the name of distributed disaggregated chassis (DDC). It has been implemented into AT&T’s core network, so if you are reading this post in America you are probably using it.

So how come the UEC is aiming to define performant Ethernet if such already exists? While DDC/DSF is well defined under OCP, it relies on an internal fabric interface between ToR and spine that is unique to the ASIC vendor (and there is more than one vendor here). So while this solution is Ethernet on the outside, it can interface to any NIC running Ethernet, peer with any router on IP routing protocols, and interact with any management platform on standard interfaces. Still, there is some vendor “ego” under the hood that the UEC is trying to “clean up.”

Deploy alternative Ethernet solutions today!

The UEC has grouped together an exceptional assembly of professionals around the worthy cause of defining an Ethernet flavor that meets the demands of today’s AI-running networks. All this while stopping the compromises of underperforming or vendor-locked networks.

While the industry is pursuing this challenging target, which can easily take a couple of years, super-performing Ethernet-based solutions do exist and are available for deployment today. This doesn’t mean that the UEC is irrelevant; rather, the industry can and should deploy alternative solutions today guaranteeing that an ecosystem will be waiting for the UEC once it starts introducing its deliverables.

Learn more about DriveNets’ DDC/DSF solution here.

Download white paper

Utilizing Distributed Disaggregated Chassis (DDC) for Back-End AI Networking Fabric

Read more