Resources

CloudNets S5 E3: Minimizing Blast Radius

Minimize the blast radius

In this episode, we dive into blast radius and how to design networks that prevent failures from cascading and impacting customers. We explore how network segmentation and distributed chassis architectures dramatically reduce blast radius and improve resiliency.

CloudNets S5E3: Minimizing blast radius
Dudy Cohen and Brad Riapolov cover how network operators should best minimize blast radius.

Chapters:

Key Takeaways

  • Segment the network so a single failure doesn’t cascade across the entire infrastructure in order to avoid large failure domains
  • Use routing and architectural “islands” to contain blast radius when issues occur thus isolating failures by design
  • A fully redundant, clustered architecture (Distributed chassis) limits outages to only directly connected customers, not the whole site.
Listen on your favorite platform

Listen on Apple Podcasts
Listen on Spotify
Watch on YouTube

Read the full transcript
Hi and welcome to CloudNets, where networks meet cloud. Today we’re going to talk about blast radius, my favorite topic. And we have the blast radius expert, Brad here with us. So Brad, let’s talk about how to minimize the effect on your customers when a failure occurs and failures occurred in the network. So how do you. No, make it affect as little customers as possible?

Yeah. So dude, the idea here is to avoid large single failure domain in your network. You got to split it out in some way, whether you’re doing IGP, whether you’re doing BGP, where you’re doing some sort of islands or pockets of your network. But you do not want a single event somewhere on your network to cascade down and affect your entire network domain.

So basically cut the network into pieces and to some extent isolate them so when a failure happens, it will not propagate, propagate across the entire network. Okay, that’s a very good advice. Do we have something to help operators to minimize even further the blast radius? Let’s say big site is failing like a big chassis is failing. All the customer connected to this chassis are down. Do we have a way around it?

Yes, we do. With our DDC distributed chassis approach, we are building, we’re taking white box elements and we’re constructing a large cluster. Cluster is fully redundant. It has the, it has redundant data path element as well as redundant control plane elements as well. So any failure in any part of the network, whether it’s in the core, whether it’s in the edge, the cluster has the intelligent mechanisms to quickly recover and carefully reroute traffic away from the failure.

So basically when a failure happened, let’s say someone pulled the flag out of a box, the only customer affected are those directly connected with this box and not the ones that are connected to the other box in the same virtual chassis.

That’s the idea and that’s what we expect our networks to do in 2025 and beyond.

Okay, thank you very much, Brad. It was very insightful. Thank you for watching. See you next time on CloudNets. Bye bye.