Resources

Cloud Nets VideosAugust 25, 2023

Season 3 Ep 8: Performance at Scale

How is the Network Cloud-AI solution making this performance at scale possible?

For AI Networking you need performance, very high performance and you need very high scale. But you need this performance to be consistent even in this high scale.

Listen on your favorite podcast platform

Listen on Apple Podcasts
Listen on Spotify

Full Transcript

Hi and welcome back to CloudNets, where networks meet cloud, in this case the AI cloud, because we are going to talk about Network Cloud-AI
and AI networking backend fabric in general.
And we’re going to talk about a specific phrase we are using, which is Performance at Scale.
Because for AI Networking you need performance, very high performance and you need very high scale.
But you need this performance to be consistent even in this high scale.
And we have our AI expert at scale, at scale, Run our chatbot to explain how is the Network Cloud-AI making this performance at scale.
So let’s start with DDC where performance at scale kind of applies to DDC inherently, right?

AI brought this problem to a higher level, higher scale and definitely higher performance.
So first off, when we talk about AI Networking, we’re talking about DDC as a fabric versus the alternative which is actually a network.
When you have a network, you have all sorts of network issues, protocols, negotiation, distribution of messages,

because it’s within an Internet network which has multiple hops, multiple hops and multiple brains above all.
And when you have a fabric, there is no path that you need to choose.
You just run throughout the entire fabric.
Like we’re connecting all the GPUs to a single very large ethernet node or switch.
In fact, it is what we’re doing.
It is what we’re doing.
So that’s item number one.
Item number two is that we are introducing a new level of scale.

First off, it’s a new ASIC that we are using the Jericho 3 and Ramon 3, which kind of boost up the overall capacity bandwidth-wise of the DDC solution.
Second is we are introducing a topology which has two tiers of a fabric, one above the other.
So in terms of fan out, we can reach up to 32,000 endpoints in this case.
That’s a lot of that’s a lot of endpoints.
Yeah, 800 gig each.
Okay, so that’s topic number two.
And third is when we’re talking about failover.
When you have a network which is at a capacity of 30,000 nodes, failure is not the exception, it’s the steady state.
You always have somewhere some sort of a failure scenario.
And you need a network that knows how to identify these issues as fast as possible and react to these.
When you have a fabric, you don’t need to divert the path from the failed path onto a new one because there is no path, just use the entirety of the fabric.
So you just need to react by reducing the broken link.
And there you go, there you have it.
So failure recovery is instant, we’re talking about microsecond level of recovery from failure.
Okay?
And this is very important for AI because if a job fails, it needs to go back to the beginning or to the last checkpoint failure impact,
which means JCT is badly degradated.
AI is not really a forgiving topology, a forgiving application.
It hits you.
When you hit it back, it hits you.
So this is why Network Cloud-AI is a very good solution for AI networking and AI fabric.

Three reasons.
One is the performance, which derives from the fact that it is not a network, rather a very big, very scalable fabric or chassis distributed chassis, actually.
So you have one ethernet hop from any GPU to any GPU, and everything inside is a cell based fabric which has a lossless connectivity.

The second is scale.
Scale is achieved with a new chipset from Broadcom, the J 3, J 3 and Ramon 3.
And with a new architecture with a two tier fabric, which means we can scale up at this point to 32K GPUs or 800 gigabits per second ports.
And the third is failover, which is super important.
It’s almost a seamless failover because everything is managed within this fabric and you do not need to reroute or rehash the tables.
And that means that we’re talking microseconds.
Microsecond always on behavior.
Yeah.

Okay, so this is very interesting.
We’re going to talk about each of those topics in length in a separate video series.
Stay tuned.
But for now, thank you very much, Run, thank you.
My pleasure for watching.
See you next time at CloudNets and
CloudNets AI.
Bye.