Season 3 Ep 2: Challenges facing AI Networking Infrastructure
Challenges of AI Networking
With the fast growth of AI workloads, network solutions used in the fabric of AI clusters need to evolve to maximize the utilization of costly AI resources and support standard connectivity that enables vendor interoperability. AI clusters are supercomputers built to perform complex, large-scale AI jobs. These systems are composed of thousands of processing units (predominantly GPUs) that need to be highly utilized. This poses significant challenges to the AI networking infrastructure, which needs to support thousands of high-speed ports (400 and 800 Gbps).
Listen on your favorite podcast platform
Listen on Apple Podcasts
Listen on Spotify
Full Transcript
Hi and welcome back to CloudNets, where networks meet cloud.
And today we’re going to talk about AI, and specifically about AI networking.
And we have our chat bot. Yeah, we have our very own yeah, artificial expert, Run, thank you for joining.
My pleasure.
So, Run, AI is a big thing lately with all ChatGPT and Bard and everything, and it’s growing.
And we want to talk about the challenges behind the AI infrastructure, because AI ML training, it’s very compute, intense task.
But also the networking, the back end networking that connects between all the parallel computers is a very, very important and critical infrastructure for AI.
So what are the three, let’s say, main challenges in AI networking that companies like the Hyperscalers that are going into this market now needs to resolve?
Okay, I’ll break it down to three things.
First off, in AI, HPC as well, application is king, and it boils down to three main ideas or three main pillars.
First off is that you have a variety flexibility of different applications running on that same network. An AI network or an AI cluster
needs to be more connected to the outside world versus an HPC, which is kind of more of a back end kind of deployment.
Okay, so classic HPC is isolated, while AI is something that has the back end and the compute and the training, but also the connectivity to the online realm.
And more varied, more different applications running at the same time, different sizes, different length, different duration of the application that’s more varied in AI, whereas HPC is more.
So basically, the network needs to be some kind of connected or online. So preferably the back end and the front end are the same technology, but also very flexible, scaling up and down
and accommodating various applications.
Yes.
Okay, this is one.
That’s number one.
Item number two is that an AI network is big.
HPC network is also big. Right.
It’s not a differentiator but AI is so big that when you take an application and run it in a very small scale, you get certain level of performance.
That’s good.
When you expand the scale almost linearly, as the network grows, the application performance degrades.
So when you talk about application performance, you talk about job completion time. Job completion time overall kind of sums it up.
In job completion time and in the networking domain, we’re talking about nonstop connectivity, predictable connectivity failure. The network needs to be completely transparent to the application.
Like I said, application is king.
The network needs to stay out of the way.
Do not disturb.
Do not disturb the application.
So this is basically number two.
Item number three is that AI networks are big.
They are deployed by the largest players in the industry.
They need to be rock solid. It needs to be technology which is open, which allows multiple vendors to chime in.
Nothing that locks you into a certain application with a certain technology, with a certain vendor, anything which is not a lock in, anything which is more field proven and proven by multiple players and for a long duration.
This is what an AI network needs. Okay, so, wow, those are some challenges.
Not easy.
Just to sum up, we will have another episode, I think, in order to understand how we resolve those challenges, because it’s a fairly big task.
But just to sum up the challenges, first of all, we’re talking about flexibility and connection online, meaning that you need to accommodate multiple
applications, you need them to be connected to the Internet in order to be interactive, et cetera. This is one big challenge.
The second challenge is performance and scale and performance at scale. That means that it should be non-stop predictable, zero packet loss, very low jitter, et cetera, et cetera.
And it needs to keep this performance as the scale grows, which is the main pillar of this channel challenge.
And lastly, it needs to be a safe bet. You don’t take chances here. You don’t rely on one vendor, you don’t rely on nonfield proven technology.
You need things to work. And as you said, you need to forget about the networking part because the compute is the king.
The GPUs needs to feel they are connected and nothing interrupts their connectivity.
The networks needs to be there and needs to be transparent.
So this is the big challenges, very big challenge and very big investment.
As such, you need to reduce the risk.
Absolutely.
So thank you very much, Run, for
this pleasure and thank you for watching.
Stay tuned for our next episode in which we will talk about those challenges but from this resolution angle.
So see you next time.
Don’t miss it.
Bye.