September 20, 2023

VP of Product Marketing

Panel Recap: Next Generation System Design for Maximizing ML Performance

Last week, I attended theAI Hardware & Edge AI Summit in Santa Clara. I was honored to participate in the “Next Generation System Design for Maximizing ML Performance” panel, together with Drew Matter from Mikros Technologies, Albert Chen from Amphenol, and Greg Stover from Vertiv. 

Panel Recap: Next Generation System Design for Maximizing ML Performance
Getting your Trinity Audio player ready...

At first glance, this group of people and vendors may seem a bit unrelated. Yet there is a significant commonality for all the diverse fields in which we work. This made our selection insightful, interesting and fun. In case you did not have a chance to be there, I’d like to share my main takeaways from the panel. 

First, to give you a bit of context, here’s the panel’s abstract: 

Hyperscalers, cloud providers, industrial giants and large enterprises are in the midst of an AI arms race, investing serious time, effort and budget into building larger and larger AI/ML clusters. Building those enormous supercomputers is not only about stacking thousands, or even tens of thousands of GP-GPUs. Under the hood there are many infrastructure elements that, while accounting for <20% of the cluster total cost, dramatically enhance or limit the overall performance, cost, utilization and ultimate value of those computing achievements. Join us for a tips and tricks session on how to carefully build those make-or-break elements, which include power, cooling, cabling, racks, connectors and connectivity. 

Here are my main takeaways from the panel session.

A surprising networking bottleneck 

There is an underlying bottleneck for AI infrastructure that is gaining more and more attention. It has to do with anything that “feeds the beast.” When building AI compute clusters with sizes of hundreds, thousands and even tens of thousands of general-purpose GPUs, or any other AI compute chipset, most of the attention and budget go to those compute elements. Such elements (and specifically, Nvidia’s GPUs) are in shortage today, which seems like the major bottleneck in the evolution process of AI and ML. It turns out, though, and as more and more hyperscalers mention, the network bottleneck may lie elsewhere.

Several underlying components are essential for bringing up those massive compute capabilities. The ones discussed in our panel were power, cooling, cabling and networking. The common factor for all those elements is that while they are a minor part of the total cost of an entire AI infrastructure (networking, for example, is typically around 10% of the total cost), they can affect the performance, or even the feasibility, of the entire project. For example, an extremely expensive GPU is a waste of money if it stands idle and awaits networking resources, or if it cannot be powered or cooled and thus waits in a warehouse. 

There’s a saying that a high-performance AI back-end networking fabric (such as DriveNets Network Cloud-AI) can “pay for itself.” This is the case as it improves GPU performance (in terms of job completion time) by more than 10% (representing more than its cost). But it is not only a cost consideration. Since power and cooling are often limiting factors when building AI infrastructure, you must seek the best infrastructure combination in order to use less GPUs for the required task while fitting into the power envelope. 

A nonstop AI-ML race 

The AI-ML race does not stop. There are multiple technologies that are available (or under development) in multiple infrastructure areas. This applies to all of the fields covered in the panel, and many more. But there is no time to wait for the winning technology to be announced. The outcome is that hyperscalers are pursuing multiple technologies in parallel and building multiple AI clusters, knowing that the winning technology is in their datacenter, even if they do not know yet which one it is. A good example for such strategy was given by Petr Lapukhov, a software engineer from Meta, who described how Meta is building AI clusters both with InfiniBand and Ethernet-based fabrics. 

A critical era for AI-ML infrastructure, applications and scale 

The abovementioned strategy, which is interesting though somehow borderline chaotic, will not last forever. One of our main takeaways from this panel (and from the event, in general) is that we are now in a critical era for AI and ML infrastructure, applications, and scale. 2023 and 2024 will prove to be critical for the formation of this industry. A year from now, we guess, the way forward will be clearer. Until then, we have a very busy year ahead of us. 

.  

Download

Utilizing Distributed Disaggregated Chassis (DDC) for Back-End AI Networking Fabric

Read more