OCP Global Summit: Congestion Management in an Ethernet based network for AI Cluster Fabric
Watch the Panel
In this panel, Dudy Cohen, VP Product Marketing at DriveNets, explores how AI cluster fabrics pose unique challenges on network infrastructures.
Full Transcript
Okay, I’ll start.
Okay.
Hi everyone, my name is Dudy Cohen from DriveNets.
This is Larry from Accton.
We’re going to talk about congestion management, congestion control or
congestion avoidance in large AI GPU clusters.
If you were here in the last session with Meta, some of the content
is similar and built up on top of what they talked about with regards
to DSF and DDC.
So it’s a good follow up.
So along the event we talk about networking as a possible
problem issuer or bottleneck in an AI cluster.
Specifically, when you build a large AI cluster GPU cluster for training purposes,
you might have your GPUs waiting for networking resources and standing idle.
And this is not something you want to do.
The main, the main motivation that we see when we come to resolve or
to plan our networking architecture is to ma imize the utilization of the GPUs
You want the GPUs to work as hard as they can.
You don’t want any idle cycles on the GPUs, especially not caused by a
lack of networking resources.
And those networking resources could be problematic because of different
measurement or different characteristics of the network pattern of AI cluster.
This could lead to packet rope, to out of order delivery, to jitter, and
these all are causing the GPU to be underutilized and to stand idle.
So what are the solutions for this?
If we look at specifically at congestion as a cause to all the trouble
networking can cause, there are two main methods of resolving this.
There are two main methods of dealing with congestion.
One has to do with avoiding the congestion altogether and the other has to
do with mitigating the congestion.
The first is the DDC approach.
This distributed disaggregated chassis.
As mentioned, one implementation of it is DSF that was mentioned in the Meta
session just right now.
This is basically taking the entire architecture of the top of rack and
end of roll switching and turn it into a single network entity that acts
like a very large chassis, but it’s still distributed in a sense that it
has this different switches.
The traffic towards the nics is plain Ethernet.
There is no special requirement from the NIC for additional processing.
But the traffic in the fabric is a cell based architecture as mentioned
developed by Broadcom with the DNX architecture.
And this cell base basically take any ingress packet, cuts it into evenly sized
cells and spray the cells evenly across the entire fabric.
Thus together with VOQs and ground based mechanism ensures there is no congestion
or there is no packet drop within the fabric.
So this is a way to avoid congestion.
The other methodology is congestion Control via the end devices.
So the endpoints in the case of a GPU cluster, those are the NICs.
And the NICs are spraying the packets across the fabric from the end device,
time from the end device.
This is another approach that is taken by different industry measures like uec.
The Ultra Internet Consortium, the first profile of UEC is taking this approach.
And there are also other proprietary solutions that take the endpoint approach.
Those two approaches come to resolve the main challenges that come with the nature
of the flow of traffic between the GPUs.
And predominantly this is a low entropy environment.
And when we have low entropy networking, we start to see bottlenecks, we start
to see elephant flows, et cetera, et cetera.
Those are large size flow that take the same path across the fabric and
causes head of blind blocking, causes congestion, causes packet drop, et cetera,
et cetera.
All this combined with the high bandwidth requirement of collectives like all to all
and even all reduce, makes the congestion a major problem.
And what we did together with Acton in the Acton lab is try to
simulate the two approaches, the congestion avoidance of DDC of the
scheduled fabric and the congestion management or mitigation of the endpoint
scheduling.
And try to come up with a rule of thumb of when to use
what or what use cases are suited for ddc, what use cases are suited
for the endpoint scheduling, and Larry will describe the test bed and the
results we got.
All right, well, thank you, Dodie.
So before I jump into the testbeds and do all that, let me just
also take a few seconds and talk about why are we actually doing this
lab experiment
All right?
I mean, this conference, you can hear there are a lot of hyperscalers that’s
already doing a lot of them.
They can afford, you know, they can afford to deploy 1000 GPUs and all
that and scale it all the way out.
Why are we doing it in a little lab environment?
There are two motives from doing that.
One is actually related to the call for actions at the very end of
it.
Number one, we’ve heard from not every customers unless you are a hyperscaler or
unless you are a startup that has a VC that pump a billion dollars
into your pocket.
Not everyone that’s interested in learning these kind of things can afford to
have an environment.
The community, OCP and all that.
If you look around today, talk about a lot of projects, a lot of
good consortiums and initiatives, talk about a lot of the advancement in
technologies, but the actual lack of a place hands on for
customers that cannot afford to Pump in that much money at the beginning.
But they’re curious, they want to know in data center, they want to scale
out.
So they need a place in order for them to, to experiment and run
some eperiment.
We all learn as we go into these kind of test beds.
We’ll start small here.
We’ve actually have bigger test beds but in the interest of time, you only
have a few minutes.
So what I did today is just kind of share some snapshots of what
we did we are doing.
The intent is this is the foundation of building out community testing.
This is what really Edge Core’s intention is.
So in this case what I’m showing you is just a very simple sharing
today.
A simple topologies with simple leave and spine cloth.
Apologies.
Right off the bat the decision point what we have to make is are
we really going to invest and put all these AI servers?
All right, that’s a decision point number one.
Or are we going to go with the different alternatives that go with
emulators, testers?
So obviously there are pros and cons on all of that.
Cost is a factor.
Time to market and be able to bring out all these results is a
factor.
Also it’s not just about driving the workload to emulate and pump it through
the network to look at congestion management, it’s also about if you don’t
have it, think about collecting the results, all kinds of tuning knobs that
you have to tune to learn to get that experience.
And so with a tester we think that it actually a lot of these
are built in reporting and do all of that a repetitive script driving it
to change parameters, to do repetitive measurements.
So because of that, what we have done up to this point is to
team up with Spirent.
Is there any Spirent people in the room?
Hey, as a shout out to Spirent, they very nicely partner with DriveNet and
Edge Core and that we are already forming three companies.
We need more companies to come in here and do this together.
We decided to use their solution which is very nice, you can see from
the highlight.
I’m not going to go through that but it covers everything from the speed
of the devices to your basic rocky congestion management features to
the different type of workload that they have.
And so it’s an end to end tester that is very nice.
One of the few that in my opinion is one of the few nicest
in the market.
There are others in the spirit ocp, there are others, some of them are
actually on the floor go and experience it.
So we partner with them and do some of the work.
So first up is to show you a little bit of the endpoint schedule.
What I’m showing here is really the fundamental things.
If you want to start talking about endpoint scheduled and all of that, you
always start with the dynamic load balance mode, right?
In our TestBed we use Tomahawk 5.
This is the state of the art highest end XGS by Broadcom chips
and is an edge core switch that we have.
We used edge core distribution of Sonic to drive this and
also we use the spiral and connect it directly to the static route and
therefore eliminating the need to do layer two communication.
Simplified all of these.
So many variables.
You try to keep it simple to start so that you can learn the
effect of what happened when I applied DLB.
Tamahawk 5 has different modes of DLB modes.
So that’s one of the interest that we first start and jump in and
say let’s try to understand what it really means to us.
And then of course you want to experience priority flow control and then
of course the ECN and these kind of things.
So I’m picking out, there’s so many results I can pick out.
So I’m just giving you one scenario in which how we study these things
going out first off is the KPI or the metrics that we use.
There’s no argument in here.
The first step that you have to do is drive the workload through and
then measure your job time, job completion time, the JCT that is the most
important.
So focus the graph on jct.
The bus bandwidth or also known as network bandwidth or good port.
I mean there’s all kinds of terminologies with that.
That’s just an inverse of that.
So the result is consistent.
So for this pitch let’s just focus on the jct.
What we did here is we established a baseline.
ECMP is the baseline without any congestion control.
You just run this through and this is the light green line.
It established the baseline and worst case scenario of the workload that you are
driving through it.
And then from there we try to experiment with the other common control
of and also provided by Tomahawk5, which is the spray
packet spray.
So you go from the other extreme and trying to load balance at a
packet level.
The problem with you can see that it actually brings in a better performance
of course than just a Pure ECMB or Tomahawk 5 called it fixed mode.
With the spray mode it has better performance.
But one of the problem with spray is packet ordering because it randomly
spray it out to the links.
So Just because it goes out to the multiple links, it doesn’t mean they
arrive in the same order.
So there’s an order problem.
This is all well documented.
And so Tarmac five has this thing called eligible mode, which is the best
of both world.
There’s actually another terminology called flowlet.
It’s interesting, all these technologies, every name, there’s an equivalent name
that you’re going to have to keep track of, remembering what they’re called.
But in any case, that is the best of both worlds.
So you consistently see, at least in this little experiment, that the
flowlet actually makes a difference.
The other dimension, of course, is to look at PFC versus ecn.
And then of course we do the other combo, which is the base of
this thing called DCQ en, so
whatever that thing is called.
So it’s basically a combination of PFC and ECN with additional tuning knobs.
So we try to run it through all of that and then see what
happened.
The workload is small enough, topology is small enough, ECN does not make a
difference.
In fact it may actually make it worse.
The threshold kicks in too early.
So what we did in the next slide is to dive into the ECN
tuning, because ECN itself has a bunch of tuning knobs, the basic three
parameters.
So we need to understand what they are about.
So within ECN there are three primary parameters.
The thresholds, the buffer threshold or cooling thresholds, the minimum and
maximum
And then the pmax, the probability marking probability
What it does is it looks at the measure to kill until it reaches
the minimum threshold.
And that’s when you start marking the packets to declare it as congestions
possible.
So therefore those packets will be treated differently.
And then the probability linearly go up and these packets marking as the buffer
continue to fill up and up until the maximum threshold points, which is the
second parameter.
And then how fast it goes up on marking these, how frequent does it
actually mark these packets?
Depends on the maximum marking probability.
The higher it is, the higher it goes up.
Rapidly goes up.
Nobody use.
All the literature will tell you use a lower percentage, don’t use too high
a percentage.
Okay, and rightfully so.
But in this case, you look at what we do, three parameters.
How am I going to show it to you on tuning it into two
graphs or multiple graphs.
And then if I show three graphs, we’re running out of real estate on
the slide or running out of time.
So what we did here is we tuned the minimum buffer and using just
200k, very standard, the small one, and then maximum we reach it up and
trying to increase the buffer size while fixing a certain probability.
And we choose 100% probability, that’s the outlier extreme case.
Nobody really use it at that high.
But when you lower the probability, we are seeing actually the same trend in
which as the buffer size increase, the KPI actually get
worse.
So the optimal in this particular experiment, by the way, again, do not
worry about the actual number.
This is just showing you if you are starting it, if you’re starting doing
these kind of things, where do you begin?
How do you find so many parameters?
You don’t know which one of them should you tune and which one of
them should you not tune?
Where do you start?
And so what we do here is we start this and just do the
parameters and fix it at 200k as the minimum buffer thresholds and then
5meg as the maximum thresholds.
And then we look at the marking probability and look at the results of
the training.
All of these results and reporting and actually spirant test tools, actually
their solution actually report have much better reporting than what we have.
So enough on the endpoint, let’s jump into the fabric schedule.
Okay, so the fabric schedule setup is actually very similar in terms of the
physical layout.
You still have the top of rack and then the line, et cetera.
But the difference is that everything, all the switches in the setup are the
same Ethernet entity.
So it’s a large chassis that is distributed across the data center.
We use, as mentioned, the DriveNet’s network operating system, the Broadcom,
the Annex family, the Jericho 3 and Ramon 3.
We have been using Jericho 2 and Ramon 1 in the field already, and
this is the first implementation of Jericho 3.
And the rest of the setup and tests are basically the same.
Now what’s interesting to see are the results compared to the Tomahawk
architecture.
So if we move to the next slide, we see that basically in terms
of job completion time, we have very good results on the left part.
With regards to the scheduled fabric, when you compare it to the Tomahawk endpoint
scheduling architecture, you see that in a fixed architecture there is a significant
difference because it is underperforming.
Tomahawk is underperforming because of everything we mentioned earlier.
But if you do all the work that Larry mentioned and fine tune the
Tomahawk architecture or the endpoint scheduling architecture, you can get very
close to the performance, at least at this small scale.
At larger scales it’s a bit harder to fine tune.
But if these small cells, you can get very close to the performance of
the fabric Scheduling.
The main difference is the amount of effort you need to put into fine
tuning and twisting the knobs of this architecture.
This is the same as with Infiniband.
If any of you experience fine tuning Infiniband.
So this leads us to the question, when do we use what?
Because the scheduled fabric is very simple, it takes no fine tuning
and performs well.
And the Tomahawk architecture is very simple to implement, but needs a lot
of fine tuning.
So Larry will sum up and say when do we use what?
Right.
So again, our objective here for this pitch is not to tell you which
one, which approach is better.
That’s not the case.
It all depends on a lot of things.
So what we’re trying to do here nonetheless is trying to leave you with
some guidelines.
First off, define your workload.
Are you running a similar type of jobs through your fabrics every single
time or more or less?
Or are you a GPU as a service type cloud service operator where
you have multi tenancy?
So I know my time is up, give me 30 seconds.
I don’t have anybody that’s come in behind me, right?
So that’s my beauty about this.
Sorry, I’m just teasing you.
So if you have multi tenancy type of environment where the jobs that’s
running in your fabrics changes fluctuated a lot, we think that the fabric
schedule is probably a better approach at this point because it doesn’t require a
lot of tunings.
The other part of course is the latency.
Whether it is more sensitive, the workload that is more sensitive, you’re running
training or inference, what kind of jobs are you running?
That is also important.
The cooperation on the other hand between the Server or the GPUs and the
network side make it possible to do a lot of work.
However, with the endpoint tuning, however, what it does is it actually
would then take resources away from your compute stack.
Is that something that you want to do?
So a lot of these are your decision point.
It is too early.
So call to actions.
Really quick a few call to actions.
I already outlined that the community needs a place for people who want
to experiment these kind of things to learn how to tune, especially on the
endpoint side to come in and work with it.
We love to have these kind of interOps test lab.
OCP used to have it for the old timers, but last couple of years
I don’t see them anymore except during the event.
So cross company collaboration is also needed.
Not to mention again on the fabric side, the Sonic community has VOQ in
there for a long time.
But it’s been stuck in the last four years.
I just gave the feedback to the community yesterday.
You want to advance Sonic?
It is a widely deployed north in the data center.
This is something that Sonic needs to pick up.
So that’s a summary.
Thank you.
Thank you.