Season 4 Ep 2: AI Network Fabric
What are the different ways to implement AI fabric?
We have an issue with resolving the AI fabric, or an AI networking problem with large clusters of GPUs usually used for training. This episode looks at the issue with resolving the AI fabric and explores how an Ethernet based solution can resolve this – by building a chassis which is distributed (a disaggregated, distributed chassis). This approach has no packet loss, is lossless and is a fully scheduled fabric, but without the scale limitation of a chassis.
CloudNets S4E2: Resolving the AI networking problem with large clusters of GPUs
The three issues are that the problem itself is derived from the fact that we use RDMA, which means that we need a lossless scheduled, high performance fabric. The endpoint scheduling solution, and the network based solution for resolving this issue.
Key Takeaways
- AI networking problem: derived from the fact that we use RDMA, which means that we need a lossless scheduled, high performance fabric, and the elephant flow nature of the information distribution within the cluster
- Endpoint scheduling: relies on the endpoints, which needs to be very smart, very compute and power savvy, like DPUs.
- Network based solution: The network based solution is practically building a chassis which is distributed, hence disaggregated, distributed chassis, which means you have no packet loss, a lossless and fully scheduled fabric
Full Transcript
Hi, and welcome back to CloudNets-AI, where networks meet cloud.
And today we’re going to talk about
AI and specifically about AI network fabric and the different ways to implement it.
 We have our AI network fabric expert, Yossi.
 Hi, Yossi.
 Hey, everyone.
 Thank you for joining us.
 Thank you for having us.
 So we have an issue with AI
network fabric, right.
 We have some requirements.
 We have some specific things we need to know about.
 Let’s understand first, what is the problem?
 What problem are we trying to resolve?
 Great question.
 Essentially, AI networks rely on two fundamentals.
 First one, they use RDMA
Memory Access.
 Now, the reason AI networks use RDMA
is because we want to reduce latency
 of read/write operations as much as we can.
 Now, the second thing we have, or
the second characteristics we have in RDMA
 or AI networks is elephant flows.
 These folks, the GPU’s that participate in a cluster, usually send very long flows  of data.
 Now, these two characteristics that I
mentioned, the RDMA nature of it and  the elephant flows, causes several problems.
 You want to talk about it?
 Oh, yeah, please.
 Okay.
 So essentially, RDMA is not tolerant for loss.
 RDMA works with an algorithm called Go-Back-N (GBN).
 So what happens is we lose a
lot of time, and time is expensive
 when you’re talking about AI networks.
 Because job completion time means the
utilization of the GPUs.
 And this is very expensive.
 Exactly.
 So first thing first, you’re not allowed
to lose any packet when you’re talking
 about AI networking or RDMA networking in
specific.
 Second thing I was mentioning is elephant
flows.
 Now, the problem with, with elephant flows
is that they naturally have low entropy,
 right?
 Which means you cannot efficiently load
balance, which means you will have packet loss,
 which contradicts the first.
 Exactly.
 So essentially what happens is with the
classic or standard ECMP or hashing
 mechanism that you have today, what
happens is, is you bombard some specific
 links in your network, while other links
or other resources in the network are
 essentially idle.
 Okay, so we understand the problem, we
understand what we need to do.
 And basically there are two main
philosophies about how to resolve this
 problem.
 One is based on the endpoints and
things we do there in order to
 mitigate congestion and to ensure all the
things you mentioned.
 The other is based on a fabric,
the network itself.
 So let’s talk about both of them.
 Let’s start from the endpoints.
 What do we do there?
 Yeah, so you mentioned perfectly, you have
two types of approaches.
 The first type, which is endpoint
congestion control or endpoint scheduling
 mechanisms, are talking about how to solve
the problem
 once it occurs.
 Okay, then you have
a type of solution that is talking
 about how to proactively prevent the
problem from happening.
 Let’s deep dive into the NIC based
or the endpoint based solution.
 So if you look at the industry
today, you’ll see all types of vendors,
 you’ll see NVIDIA offering their
SpectrumX, you’ll see all sorts
 of collaborations between switch vendors
and NIC vendors trying to
 somehow integrate the NIC into the switch
in order to solve it.
 But there are a few fundamental problems
with it, right?
 First one, it’s very costly, right?
 It’s costly because the SuperNIC or the
DPU costs a lot of
 money, right?
 That’s st.
 nd, it’s costly because the DPU usually
consumes a lot of power and a
 lot of cooling that it needs.
 And that’s basically the worst solution
you can choose.
 And I’m being specific here when you’re
talking about TCO, okay?
 Now it solves the problem of elephant
flows because then you can activate on
 your network some kind of smarter load
balancing mechanism like some kind of
 packet spraying and stuff like that, and
then somehow reorder it on the NIC
 side, right?
 So it does solve the problem, but
then it introduces another problem that we
 haven’t mentioned.
 And this is an operational problem.
 Now the operational problem that I’m
referring to is specifically with fine
 tuning the network.
 If you want to have a good
congestion control mechanism or a good
 reordering mechanism in your endpoints,
you need skillset, you need people to
 maintain that, right?
 You need people to go ahead and
prepare your infrastructure every time you
 want to run a model.
 So it solves the problem.
 On the technical side, it’s costly and
it requires some decent
 expertise and decent skillset.
 Okay, so this is a valid solution,
but it has its flaws.
 What about resolving it or avoiding the
problem altogether in the network, in the
 fabric itself?
 So if you ask me when I’m
talking, when I’m, when I’m thinking about
 AI networks, I think the optimal solution
would be chassis, right?
 Yeah.
 Imagine just taking a bunch of GPUs
and connecting it into one chassis, right?
 And then the chassis does all the
magic.
 Yeah, it’s a single hop Ethernet.
 the backplane is…
 Exactly, it’s a single
hop from NIFT to NIF or from
 port to port.
 Right.
 From Ethernet port to Ethernet port.
 It has no congestion in it.
 Right.
 The connection from the NIF to the
fabric and then to the NIF is
 end to end scheduling and you lose no
packets.
 And in terms of operations in terms
of how to maintain that.
 It’s.
 Plug and play.
 It’s given every guy with a CCNA
can do it.
 But if I have , GPUs, there
is no such chassis.
 Exactly.
 Now that’s the major problem.
 And we have a solution for that.
 You want to hear about it?
 Oh, yeah.
 Never heard of it.
 So essentially what we did with DriveNets
is we took a chassis, we distributed
 it, we disaggregated it, but that’s a
whole different topic.
 So we disaggregated it, we distributed it,
and essentially we made it scalable to
 an extent the industry have never seen
before.
 Right?
 So essentially our solution is based on
two building blocks.
 We have the NCP and NCF, while
NCPs are equivalent to the old fashioned
 line cards, and the NCF is equivalent
to the fabric board that we’re used
 to from the back plan of the
chassis.
 And essentially this distribution of the
chassis gives us the benefits of a
 chassis which is end to end VoQ
system, fully scheduled, lossless by
 nature, and then scalable.
 In fact, you can have a chassis
like solution, which is optimal in terms
 of operations and in terms of technical
abilities in AI networks, and you can
 have it scale up to , GPUs
in a single cluster.
 Okay, this is very cool.
 So thank you, Yossi.
 This was mind blowing.
 The three things we need to remember
about resolving the AI fabric, or
 AI networking problem with large clusters of
GPU’s usually used for training, is .
 One, the problem itself derived from the fact
that we use RDMA, which means that
 we need a lossless scheduled, high
performance fabric, and the elephant
 flow nature of the information
distribution within the cluster, which
 means classic load balancing like ECMP
would not work.
 So we have two solutions.
 The second and third point.
 The first solution is endpoint scheduling,
which relies on the endpoints, which needs
 to be very smart, very compute and
power savvy, like DPUs.
 This is a congestion control or congestion
mitigation solution, which
 bring you so far in terms of
performance, but it costs you a lot
 and also is very complicated to manage.
 This is coming from vendors like NVIDIA
with their SpectrumX, and also other
 vendors that are cooperating in the Ultra
Ethernet Consortium, for instance.
 And the third point is the network
based solution for resolving this issue.
 The network based solution is practically
building a chassis which is distributed,
 hence disaggregated, distributed chassis,
which means you have no packet loss,
 a lossless and fully scheduled fabric, but
without the scale limitation of a chassis.
 And this is coming, of course, from
DriveNets with our DDC, but also for
 other vendors.
 We will talk about Arista DES in
the next movie.
 So this is what you need to
remember.
 Thank you very much Yossi.
 Thank you for having me and thank
you for watching.
 See you next time on CloudNets.
 
                         
                            