March 28, 2023

President of CIMI Corporation

An Outside-In View of Network High Availability

This isn’t a tutorial, it isn’t a propaganda exercise.  Think of it as a conversation, and an important one for operators who want to optimize their networks and retain their customers.  If there’s one thing that every network operators agree on, it’s the importance of high availability (HA) in their networks.  The number one cause of churn in network services is poor availability, and customers who experience it are very unlikely to come back to the carrier with the problem.  Of course, HA has always been important, but there’s a new approach to improving it that every network operator should consider.

An Outside-In View of Network High Availability

Threats to high availability

There are three specific threats to availability according to the operators themselves.  Number one is equipment failure.  Number two operations errors, and number three is device upgrades, hardware or software.  Traditional availability management measures focus on the alternate routing capabilities of IP networks, but these measures can fail if satisfactory alternate routes aren’t available, or if the failure rate is high enough that alternate routing can’t hide problems from customers.  And while some router vendors have made a big thing about in-service software updates (ISSU), operators know that feature alone won’t have a major impact on availability.  We need to address those threats explicitly and completely. 

What is DriveNets Network Cloud?
Read White Paper

The three major threats to HA are often intermingled in the real world.  Both equipment failures and upgrades cause outages, and both also usually require specific remediation by the network operations center (NOC) personnel.  When that’s done under pressure, as it often is, mistakes can be made.  If failures create a problem with a large “blast radius”, the NOC is often bombarded with messages and even determining what’s wrong can be a challenge.  If operators want to optimize availability, they have to effectively address all three of these threats at the same time. 

Making network hardware more available 

Let’s start with the top of our list, equipment failure.  There is nothing, repeat nothing, that has more impact on HA than equipment failure, so it has to be minimized.  How do you reduce the failure rate?  By making the network hardware, the monolithic routers more available?  That’s what data center server vendors have tried, but it wasn’t enough, and customers turned to the cloud for high availability.  It’s time that network customers did the same thing, with DriveNets’ Network Cloud. 

The Network Cloud is just what the name suggests, a replacement of a monolithic model of routers with a shared-hosting, pooled-resource, model.  Take your typical monolithic router.  How many components does it have that, if they fail, take down the whole box?  How many of these can be made 1:n redundant to raise availability to five nines and beyond?  Can you plug and unplug cards without impacting the rest of the box?  Will you need spares for each router type in your network?  The fact is that monolithic routers are a veritable  nest of single points of failure, and when failure occurs, it’s usually total.  If you don’t have the right units in your inventory of spares, you could be in for a long period of service degradation, even service outage. 

Looking at Network Cloud 

Now look at the Network Cloud.  It’s a cluster of white box switches, each running a common operating system (DNOS) and supported by a common network orchestration architecture (DNOR) that supports the automation of operations, deployment, and redeployment.  You can combine the white boxes to create clusters to serve every router mission in a network, so one set of spares covers every mission.  Every component can be as redundant as necessary.  Every white box can be swapped out without impacting the rest of the cluster, and new boxes can be added without impact as well.  Even the fabric of the cluster itself is modular and try to find a modular backplane in a monolithic router!  If you design a Network Cloud cluster properly, nothing much short of a catastrophic environmental failure will take the whole cluster down. 

Network Cloud brings major availability benefits 

Think about that point for a minute.  Monolithic-fault-equals-router-down, so all the traffic that was handled by that router will now have to be rerouted through the adaptive topology discovery process of IP networks. Router networks respond to the failure of routes by creating alternates, and the network has to “settle” or “converge” on a new topology.  That takes time, and in some cases there may not be enough capacity on alternate routes to carry all the new traffic.  While convergence is going on, some packets may be delayed enough to break the applications they’re supporting, and some may be lost completely.   

A cloud failure is a failure of one resource in a pool. Because Network Cloud failures are so limited in scope, there are fewer route changes to accommodate, less traffic to absorb through redistribution, and less impact on the network overall.  Not only that, what goes down will eventually be returned to service, and here again the Network Cloud brings major availability benefits.  Because only a small portion of a Network Cloud cluster is impacted by a failure, restoration will affect only a small amount of traffic as the network adapts to the restored element. 

Network operations in the cloud is transformational 

Moving on now to operations, the Network Cloud has yet another advantage with two distinct levels of operations support.  Think a higher-level “Network” operations framework and one more refined and specific for the “Cloud”.  This means that at the highest level, a Network Cloud cluster is managed like any other router.  At the same time, it can be managed as a cloud resource pool, and it’s this duality that makes DriveNets special in terms of operations practices and Opex.   

It’s the high-level “Network” operations view, where a Network Cloud cluster preserves current operations practices, facilitates integration with legacy router networks, and accelerates the learning curve when Network Cloud is introduced.  Operators can design their network, lay out topologies, work through failure modes, and set adaptive traffic policies just the way you always have. But as I’ve already noted, the Network Cloud is distributed, so a problem’s scope is contained and that means fewer customers are impacted.  That takes the heat off the operations center, and everyone knows the demands of dealing with a massive failure contribute to haste, and to errors. 

An Outside-In View of Network High Availability

The “Cloud” view of operations is transformational.  In monolithic routers there’s not much point in root cause analysis because there’s nothing much you can do if you manage to isolate a problem.  This is a monolithic router, remember?  With the Cloud view of operations, DNOR automation lets you narrow a problem down to a specific white box, and separate hardware and software issues, and take inside-the-cluster measures to remediate.  If you need to, you can replace one of those white boxes without disrupting anything else, and you can load its software and configuration from a library to ensure it’s correct. In effect, the Network Cloud lets an operator build their own routers, dynamically, with whatever level of scalability, resilience, redundancy, availability they like. 

In the “Cloud” view of the Network Cloud, individual cluster switches can be spared within the device itself, standing ready to operate when the appropriate interfaces are connected.  New devices can be added to the cluster as fast as they can be connected because of the plug-and-play software loading and library-based configuration. 

In that same view, a Network Cloud can be partitioned into multiple virtual routers, and these virtual routers will appear in the “Network” view as real additional devices.  The same Network Cloud can be configured to be both a core and an aggregation/metro router, for example, or two core routers supporting different connectivity.  From that starting point, it’s easy to move switches from one virtual router to the other, so all the resources can be dynamically configured under changing conditions or load levels.  The combination of the cluster architecture and the virtual-router capability that Network Cloud provides makes it possible to design networks by designing network equipment, all to the operators’ own specifications and needs. 

Bringing the availability of the cloud to networks 

Networks need to be modernized just as data centers did.  Where do we look today for high-availability application hosting?  A monolithic, enormous, server?  No, a cloud, a resource pool.  What DriveNets has done is bring the availability of the cloud to networks, without changing the way networks are run, but providing all the agility and flexibility that pooled cloud resources can offer.  You can now build HA networks like you build HA hosting.  There is nothing like this on the market today, and that’s why you need to look at it. 

DriveNets White Paper

Introducing Network Cloud

Read more