CloudNets S5 E3: Minimizing Blast Radius
Minimize the blast radius
In this episode, Dudy Cohen and Brad Riapolov cover how network operators should best minimize blast radius, and how to design networks that prevent failures from cascading and impacting customers. We explore how network segmentation and distributed chassis architectures dramatically reduce blast radius and improve resiliency.
Chapters:
- 0:00 – Intro
- 00:34 – Avoid large, single failure domains to limit impact
- 01:37 – Distributed chassis further reduces failure exposure
Key Takeaways
- Segment the network so a single failure doesn’t cascade across the entire infrastructure in order to avoid large failure domains
- Use routing and architectural “islands” to contain blast radius when issues occur thus isolating failures by design
- A fully redundant, clustered architecture (Distributed chassis) limits outages to only directly connected customers, not the whole site.
Listen on your favorite platform
Listen on Apple Podcasts
Listen on Spotify
Watch on YouTube
FAQs
-
What is blast radius in networking?
Blast radius refers to the scope of impact that a failure, misconfiguration, or outage can have across a network. A large blast radius means a single failure can affect many customers or services, while a small blast radius limits the impact to a localized area.
-
Why is minimizing blast radius important for networking?
Minimizing blast radius reduces the number of customers affected when failures occur. It improves network resilience by preventing localized issues from cascading across the entire network.
-
What causes a large blast radius in a network?
A large blast radius is typically caused by large single failure domains, where many services or customers depend on the same network elements. In such designs, a single failure can propagate and impact the entire network.
-
How can network design reduce blast radius?
Blast radius can be reduced by splitting the network into smaller, isolated domains. Techniques include segmenting the network using routing protocols (such as IGP or BGP) or creating architectural “islands” that prevent failures from propagating beyond their local domain.
-
How does a distributed chassis architecture reduce blast radius?
In a distributed chassis, failures are isolated to individual elements rather than impacting the entire system. If a single box fails, only customers directly connected to that box are affected, while traffic for other customers is rerouted through redundant elements in the cluster.
Read the full transcript
Hi and welcome to CloudNets, where networks meet cloud. Today we’re going to talk about blast radius, my favorite topic. And we have the blast radius expert, Brad here with us. So Brad, let’s talk about how to minimize the effect on your customers when a failure occurs and failures occurred in the network. So how do you. No, make it affect as little customers as possible?
Avoid large, single failure domains to limit impact
Yeah. So Dudy, the idea here is to avoid large single failure domain in your network. You got to split it out in some way, whether you’re doing IGP, whether you’re doing BGP, where you’re doing some sort of islands or pockets of your network. But you do not want a single event somewhere on your network to cascade down and affect your entire network domain.
So basically cut the network into pieces and to some extent isolate them so when a failure happens, it will not propagate, propagate across the entire network. Okay, that’s a very good advice.
Distributed chassis further reduces failure exposure
Do we have something to help operators to minimize even further the blast radius? Let’s say big site is failing like a big chassis is failing. All the customer connected to this chassis are down. Do we have a way around it?
Yes, we do. With our DDC distributed chassis approach, we are building, we’re taking white box elements and we’re constructing a large cluster. Cluster is fully redundant. It has the, it has redundant data path element as well as redundant control plane elements as well. So any failure in any part of the network, whether it’s in the core, whether it’s in the edge, the cluster has the intelligent mechanisms to quickly recover and carefully reroute traffic away from the failure.
So basically when a failure happened, let’s say someone pulled the flag out of a box, the only customer affected are those directly connected with this box and not the ones that are connected to the other box in the same virtual chassis.
That’s the idea and that’s what we expect our networks to do in 2025 and beyond.
Okay, thank you very much, Brad. It was very insightful. Thank you for watching. See you next time on CloudNets. Bye bye.