Home Resources CloudNets Video Season 4 Ep 2: AI Network Fabric

Season 4 Ep 2: AI Network Fabric

What are the different ways to implement AI fabric?

We have an issue with resolving the AI fabric, or an AI networking problem with large clusters of GPUs usually used for training. This episode looks at the issue with resolving the AI fabric and explores how an Ethernet based solution can resolve this – by building a chassis which is distributed (a disaggregated, distributed chassis). This approach has no packet loss, is lossless and is a fully scheduled fabric, but without the scale limitation of a chassis.

CloudNets S4E2: Resolving the AI networking problem with large clusters of GPUs
The three issues are that the problem itself is derived from the fact that we use RDMA, which means that we need a lossless scheduled, high performance fabric. The endpoint scheduling solution, and the network based solution for resolving this issue.

Key Takeaways

AI networking problem: derived from the fact that we use RDMA, which means that we need a lossless scheduled, high performance fabric, and the elephant flow nature of the information distribution within the cluster
Endpoint scheduling: relies on the endpoints, which needs to be very smart, very compute and power savvy, like DPUs.
Network based solution: The network based solution is practically building a chassis which is distributed, hence disaggregated, distributed chassis, which means you have no packet loss, a lossless and fully scheduled fabric

Listen on your favorite podcast platform

Listen on Apple Podcasts
Listen on Spotify

Full Transcript

Hi, and welcome back to CloudNets-AI, where networks meet cloud.
And today we’re going to talk about
AI and specifically about AI network fabric and the different ways to implement it.
We have our AI network fabric expert, Yossi.
Hi, Yossi.
Hey, everyone.
Thank you for joining us.
Thank you for having us.
So we have an issue with AI
network fabric, right.
We have some requirements.
We have some specific things we need to know about.
Let’s understand first, what is the problem?
What problem are we trying to resolve?
Great question.
Essentially, AI networks rely on two fundamentals.
First one, they use RDMA
Memory Access.
Now, the reason AI networks use RDMA
is because we want to reduce latency
of read/write operations as much as we can.
Now, the second thing we have, or
the second characteristics we have in RDMA
or AI networks is elephant flows.
These folks, the GPU’s that participate in a cluster, usually send very long flows of data.
Now, these two characteristics that I
mentioned, the RDMA nature of it and the elephant flows, causes several problems.
You want to talk about it?
Oh, yeah, please.
Okay.
So essentially, RDMA is not tolerant for loss.
RDMA works with an algorithm called Go-Back-N (GBN).
So what happens is we lose a
lot of time, and time is expensive
when you’re talking about AI networks.
Because job completion time means the
utilization of the GPUs.
And this is very expensive.
Exactly.
So first thing first, you’re not allowed
to lose any packet when you’re talking
about AI networking or RDMA networking in
specific.
Second thing I was mentioning is elephant
flows.
Now, the problem with, with elephant flows
is that they naturally have low entropy,
right?
Which means you cannot efficiently load
balance, which means you will have packet loss,
which contradicts the first.
Exactly.
So essentially what happens is with the
classic or standard ECMP or hashing
mechanism that you have today, what
happens is, is you bombard some specific
links in your network, while other links
or other resources in the network are
essentially idle.
Okay, so we understand the problem, we
understand what we need to do.
And basically there are two main
philosophies about how to resolve this
problem.
One is based on the endpoints and
things we do there in order to
mitigate congestion and to ensure all the
things you mentioned.
The other is based on a fabric,
the network itself.
So let’s talk about both of them.
Let’s start from the endpoints.
What do we do there?
Yeah, so you mentioned perfectly, you have
two types of approaches.
The first type, which is endpoint
congestion control or endpoint scheduling
mechanisms, are talking about how to solve
the problem
once it occurs.
Okay, then you have
a type of solution that is talking
about how to proactively prevent the
problem from happening.
Let’s deep dive into the NIC based
or the endpoint based solution.
So if you look at the industry
today, you’ll see all types of vendors,
you’ll see NVIDIA offering their
SpectrumX, you’ll see all sorts
of collaborations between switch vendors
and NIC vendors trying to
somehow integrate the NIC into the switch
in order to solve it.
But there are a few fundamental problems
with it, right?
First one, it’s very costly, right?
It’s costly because the SuperNIC or the
DPU costs a lot of
money, right?
That’s st.
nd, it’s costly because the DPU usually
consumes a lot of power and a
lot of cooling that it needs.
And that’s basically the worst solution
you can choose.
And I’m being specific here when you’re
talking about TCO, okay?
Now it solves the problem of elephant
flows because then you can activate on
your network some kind of smarter load
balancing mechanism like some kind of
packet spraying and stuff like that, and
then somehow reorder it on the NIC
side, right?
So it does solve the problem, but
then it introduces another problem that we
haven’t mentioned.
And this is an operational problem.
Now the operational problem that I’m
referring to is specifically with fine
tuning the network.
If you want to have a good
congestion control mechanism or a good
reordering mechanism in your endpoints,
you need skillset, you need people to
maintain that, right?
You need people to go ahead and
prepare your infrastructure every time you
want to run a model.
So it solves the problem.
On the technical side, it’s costly and
it requires some decent
expertise and decent skillset.
Okay, so this is a valid solution,
but it has its flaws.
What about resolving it or avoiding the
problem altogether in the network, in the
fabric itself?
So if you ask me when I’m
talking, when I’m, when I’m thinking about
AI networks, I think the optimal solution
would be chassis, right?
Yeah.
Imagine just taking a bunch of GPUs
and connecting it into one chassis, right?
And then the chassis does all the
magic.
Yeah, it’s a single hop Ethernet.
the backplane is…
Exactly, it’s a single
hop from NIFT to NIF or from
port to port.
Right.
From Ethernet port to Ethernet port.
It has no congestion in it.
Right.
The connection from the NIF to the
fabric and then to the NIF is
end to end scheduling and you lose no
packets.
And in terms of operations in terms
of how to maintain that.
It’s.
Plug and play.
It’s given every guy with a CCNA
can do it.
But if I have , GPUs, there
is no such chassis.
Exactly.
Now that’s the major problem.
And we have a solution for that.
You want to hear about it?
Oh, yeah.
Never heard of it.
So essentially what we did with DriveNets
is we took a chassis, we distributed
it, we disaggregated it, but that’s a
whole different topic.
So we disaggregated it, we distributed it,
and essentially we made it scalable to
an extent the industry have never seen
before.
Right?
So essentially our solution is based on
two building blocks.
We have the NCP and NCF, while
NCPs are equivalent to the old fashioned
line cards, and the NCF is equivalent
to the fabric board that we’re used
to from the back plan of the
chassis.
And essentially this distribution of the
chassis gives us the benefits of a
chassis which is end to end VoQ
system, fully scheduled, lossless by
nature, and then scalable.
In fact, you can have a chassis
like solution, which is optimal in terms
of operations and in terms of technical
abilities in AI networks, and you can
have it scale up to , GPUs
in a single cluster.
Okay, this is very cool.
So thank you, Yossi.
This was mind blowing.
The three things we need to remember
about resolving the AI fabric, or
AI networking problem with large clusters of
GPU’s usually used for training, is .
One, the problem itself derived from the fact
that we use RDMA, which means that
we need a lossless scheduled, high
performance fabric, and the elephant
flow nature of the information
distribution within the cluster, which
means classic load balancing like ECMP
would not work.
So we have two solutions.
The second and third point.
The first solution is endpoint scheduling,
which relies on the endpoints, which needs
to be very smart, very compute and
power savvy, like DPUs.
This is a congestion control or congestion
mitigation solution, which
bring you so far in terms of
performance, but it costs you a lot
and also is very complicated to manage.
This is coming from vendors like NVIDIA
with their SpectrumX, and also other
vendors that are cooperating in the Ultra
Ethernet Consortium, for instance.
And the third point is the network
based solution for resolving this issue.
The network based solution is practically
building a chassis which is distributed,
hence disaggregated, distributed chassis,
which means you have no packet loss,
a lossless and fully scheduled fabric, but
without the scale limitation of a chassis.
And this is coming, of course, from
DriveNets with our DDC, but also for
other vendors.
We will talk about Arista DES in
the next movie.
So this is what you need to
remember.
Thank you very much Yossi.
Thank you for having me and thank
you for watching.
See you next time on CloudNets.