Season 3 Ep 13: Failure Recovery
How does our Network Cloud-AI provide a predictable, lossless, and very fast convergence failure recovery?
Failure recovery is a very big issue when it comes to AI clusters because there are always failures and when the failure come, it’s a big thing because you need to stop the calculation, go back to the last checkpoint. You lose a lot of time and money and resources that are spent idle and and wasted time. And the networking part is crucial in order to create a fail.
Listen on your favorite podcast platform
Listen on Apple Podcasts
Listen on Spotify
Full Transcript
Hi, and welcome back to CloudNets-AI, this miniseries spinoff we have in order to go deeper into AI infrastructure and
specifically AI fabric.
And today we have a very special guest star, Yuval, our
head of product.
Hi, Yuval. Thank you for joining.
Thank you. Thank you for having me on the show.
Exactly. Well trained. So, Yuval, we want to talk about failure recovery, and failure recovery is a very big issue when it comes to AI clusters, because, as you know, there are always failures.
And when a failure come, it’s a big thing because you need to stop the calculation, go back to the last checkpoint, you lose a lot of time and money and resources that are standing idle and wasted time, et cetera. And the networking part is crucial in order to create a fail safe environment. And our Network Cloud-AI provides a predictable, lossless, very fast convergence. Failure recovery.
How do we do that?
Done! You gave all the answers.
So maybe let’s talk first about the problem that cloud providers will experience or experiencing today when trying to build a big planning cluster.
First of all, you went out and purchased a very large amount of GPUs.
You can say 8000, 16,000, maybe 32,000.
That’s millions of dollars of investment in infrastructure that needs to be 100% utilized all the time.
Now, what’s the problem?
Like you mentioned, what’s the problem they’re trying to solve, or why do they need to take care of failure recovery when there is a failure today can take any infrastructure or any architecture like Clos or InfiniBand. If there is a failure, you just stop. Now, when you stop, money is being spent, and what you’re trying to take a look at now is how fast can I recover my service or my model training, or the specific layer that I’m trying to calculate, bring it back
into action so I can run my GPUs back.
So from the networking side, is it safe to say that we want to be below the threshold above which the entire job needs to be restarted and reset it to the next checkpoint?
Yes, there’s time to recovery, and you need to be below that. You want to make sure that every time, if there is some kind of a failure, whether it’s the spine itself, whether it’s connectivity, maybe to GPUs, or maybe between the leaf and spines, if there is a failure there, you want to make sure there is no interruption to the model or the job that’s running.
So you want to keep that running at all costs. But what happens today is most cloud providers trying to solve that, not on the infrastructure itself.
They’re trying to solve it on the endpoints. They’re trying to change the way they’re building checkpoints.
They want to bring the storage back or closer to the actual GPUs to make sure the copy of the checkpoints is faster than they used to have before.
So they’re trying to do a lot of changes in their infrastructure, but it’s not the actual fabric. What we are offering with our solution is the fact that we have a lot of
advantages that actually in most cases make sure that the failure recovery is seamless and you don’t know that there is a failure on the fabric itself and the job keeps on running. So it’s kind of maybe a bit redundant to invest a lot of efforts in trying to building all kinds of mechanisms that bypass that and just invest in the actual fabric that gives you those flexibilities. So you have speed up between the leaf and spine, so you have multiple links and you have cell spreading. If one of them fails automatically, there is detection by the hardware less than a millisecond. And it switches all the traffic all to the remaining uplinks, so there’s no impact to the actual job.
Okay, so let’s just explain the term speed up. That means oversubscription of the uplinks versus the ingress traffic. So we have more fabric links than we need. So we can accommodate any failed link with the rest without. Yeah, so let’s take an example. You have a lift and you have 20 GPUs connected to it with 400 gig links.
Now usually in a cluster topology, the same 20 links you have, which we call them downlinks towards the GPU, are going to be the same 20 uplinks from that leaf to any other spine that you have in a Clos topology. When you’re talking about DDC, you’re going to have more, maybe 10%, maybe 15%, 22 links, 24 links. That means that in case of a failure of going down from 24 links to 23 links, nothing is impacted because the amount of traffic you have is only for 20 links. Okay? So it means there’s no impact to the actual traffic. That’s one point. The other point is how fast can you actually move traffic
from that link, number 24 to the other 24 links?
First you need to detect the. You need to detect it. And we have a hardware mechanism that has very fast detection. And using software, obviously you move the traffic onto the other remaining links and that needs to be seamless so the job is not impacted.
And I think that’s the key point.
From our perspective, if there is a failure in the infrastructures, and there is a lot of failures on the infrastructure, especially in a very big 16,000 GPU environment. There’s hundreds of leafs and spines, there’s thousands of GPUs, there’s going to be failures. It’s like a very big network. So every time you experience a failure in a leaf, a failure in a spine, a failure in one of the uplinks, if
it’s seamless and the job doesn’t even notice there was a failure, then you’re not losing money.
And I think that’s the key point. You need to make sure the GPUs are 100% running at all times. Okay? So we detect the failure much faster than any external entity that monitors it because it’s hardware based.
There’s also the convergence and reroute speed, because if you run some kind of internal gateway protocol in order to sync between the leafs and spine, it will take time for it to converge while we do it hardware assisted, so it’s immediate.
You’re right. So if you try to compare what we have in terms of hardware and software solution versus the other alternatives in the market, you might have an SDN controller like solution
which monitors the entire network, but then it takes time to detect the failure and notify the entire network.
That now needs to be some kind of reconvergence. You can use BGP across the network, but then again, it’s a routing protocol used to converge routing entries for Internet, not specifically for
AI workloads. You need something fast, very fast, sub millisecond. What we have is detection using hardware. So there’s no external controller that does that. It happens immediately locally on every one
of the boxes. The other section is a software solution that we build that is very, very fast and synchronizes the entire infrastructure. There was a failure. Move traffic aside, and that’s a key point, because every decision is made locally on each one of the boxes.
They don’t need to wait for the entire network to converge. So once you have that hardware detection and that software that makes that decision, you’re much faster than any alternative in the market.
So on all the steps of detection, decision propagation, we provide a very fast convergence as opposed to any alternative.
And this is basically how we stay below the threshold that affects the upper layer work.
Exactly. So instead of trying to fix or revert the failure once it happens, we’re trying to avoid the failure from happening.
And that’s the key point.
We want to save money, pretty much.
Absolutely.
Okay, thank you very much.
Thank you.
Thank you for watching. This was how
we handle very fast fault recovery on
all levels, on detection, on decision, and
on propagation. And the bottom line is
we allow the workloads to work seamlessly
and not stop and go back to
the last checkpoint.
Thank you for joining us. Thank you for watching.
We’ll be
back with additional CloudNets-AI soon.
Thank you. Bye.