Cloud Nets VideosJune 20, 2023

Season 3 Ep 5: Fault Detection Isolation and Recovery

Recovering from major faults

What are three things we need to know about how Network Cloud that can make your life safer, easier, quieter when it comes to recovery.

Listen on your favorite podcast platform

Listen on Apple Podcasts
Listen on Spotify

Full Transcript

Hi, and welcome back to CloudNets, where Networks meet Cloud.
Today we’re going to talk about recovering from major faults, those things that keep you up at night.
And we have Calin all the way from Canada, Toronto, our expert to talk about this.
So, Calin, what are three things we need to know about how Network Cloud can make your life safer, easier, quieter when it comes to recovery.
Let’s talk about it.
It boils down to three major things.
It always does.
What it boils down to is a small blast radius.
If something happens, it’s very locally isolatable, you know where it is.
That talks to the idea that the DDC has really simple failure detection.
We can look all the way into all the piece part components, all of the disaggregated components, and find out where the fault really is.
This is really hard to do in a chassis because it’s all inside one big black box and it’s hard to figure out what actually happened here sometimes without getting a lot of sleepless nights in the process.
And the third thing is really fast recovery because the DDC is one big orchestrated set of white boxes.
When we apply things like hot patches, we can apply them once from an operator perspective and have the software distribute them on its own to all of the leafs that we have in these really humongous routers, so that
you don’t have to do it one at a time, router by router, maintenance window after maintenance window.
Okay, that’s great.
So let’s see if I got it right.
So the three things you need to know about how Network Cloud makes your life quieter at night.
One is that once you have a major fault, it’s very isolatable.
First of all, it’s smaller blast radius.
And second of all, you can isolate it and take care of it without it affecting the rest of the chassis, router, cluster, whatever, because everything is containerized and network function resides in container and you can isolate it.
So this is in terms of blast radius.
The second is that your ability to detect and to run a root cause analysis, et cetera, is much greater when I talk cloud because you have visibility even to the fabric which you don’t have in the chassis.
So you see all the intrinsics of the cluster and you can isolate and identify the problem very easily.
And third is that you need to apply a solution or hot patch or whatever, unlike in a Clos architecture in which you need to go one by one, router by router and apply it here.
This is a single entity, or it’s a single managed entity in which you can apply the patch to all the boxes in the cluster at once, automatically, and resolve the issue on your time network.
The operator does it once and then the software does it box by box on its own so you can sleep better at night.
Yes, this is our main goal to sleep better at night.
Thank you very much, Calin, for coming all this way.
Thank you for watching.
See you next time on CloudNets.