Cloud NetsSeptember 19, 2022
Season 2 Ep 10: Network Cloud Operations
Building a Cluster
Now that we’ve established that building a Network Cloud cluster is easy, let’s talk about what happens afterwards because the operations guys are the ones that having been doing the hard work all those years, after the engineering plans and installation.
Hi and welcome back to CloudNets where networks meet cloud. And today we’re going to talk
about the operational part of the cluster of the Network Cloud. And again we have Ryan.
Hi Ryan, thank you for joining. Do you remember Ryan and our SVP of operations with a lot of experience
in all kinds of operations and now that we’ve established that building a cluster or
building a Network Cloud is easy. Let’s talk about what happens afterwards because between us,
the operations guys, those are the ones that make the hard work
all those years after the engineering plans and installation.
So Ryan, what happens afterwards? How is it to
operate cloud-native networking infrastructure?
It’s different, but I think there are some benefits to doing it right. One of the things that I think
we always underestimate from a design perspective and have to deal with as operators is what
happens when things break. And one of the challenges that I’m sure we’ve all been faced with is
when we have a big vertically integrated chassis and something goes wrong, how do you figure out what part it is that’s broken?
Is it software? Is it hardware? What piece of hardware?
Is it root cause analysis?
Yeah, these things are really hard and not just root cause analysis, but also in the moment, trying to
restore service to your customers. Right. So if I’ve got a big chassis, it could be a line card, it could be an
optic, it could be a cable, it could be a fabric element, it could be a controller card, power supply.
All these different elements. And the only indicator that I necessarily might have
is there’s something wrong inside this chassis. If I’ve got a disaggregated model I can look at,
it’s probably pretty clear to me that there is a misbehaving box in the fabric someplace and I can just replace
that single box relatively easily versus trying to sort of replace discrete elements all the time.
So operationally, I think that probably speeds time to resolution for us. And I like that.
So basically you can isolate the cause of the fault and deal with
it much easier when everything is distributed than if
it’s a back plane or a card because there you cannot really physically isolate
the cause. What about upgrades and
maintenance windows? We have benefits there, right?
I mean, being able to replace a fabric module, for example, right, I can pop out a two U box and pop a new
one in and just drain traffic off that element versus trying to replace a half
a wax chassis that requires a forklift in some cases. Right. That’s not a lot of fun.
So from maintenance perspective and an upgrade perspective, there’s benefits there too.
Okay. And bear in mind that in terms of blast radius, in terms of
the effect of different events in a cluster, this
could be very easily contained in comparison to a big chassis
in which everything is affected whenever something goes wrong.
Yes, definitely. Awesome.
So we are optimistic. Thank you very much, Ryan, for that. You’re welcome.
Thank you for watching. Remember, building a network like cloud brings benefits not
only in terms of innovation and savings, but also in terms of operations from
the planning and installation and throughout the operations the infrastructure.
Thank you for watching. See you next time on CloudNets. Thank you.