June 23, 2021

Run Almog

Head of Product Strategy

Highly Available Network Infrastructure

When building a network infrastructure for mission-critical traffic, “carrier grade” means high availability

Many years ago, when coming to define what “carrier-grade Ethernet” means, I heard a simplified explanation saying “it is Ethernet run by a carrier.” It took me several years to realize just how accurate this simple definition really was.

In this post I’ll start with a description of what “carrier grade” means, then explain how it stands out with the DriveNets Network Cloud solution, and lastly review what makes this simple definition so accurate.

Highly Available Network Infrastructure

How do we Define Carrier-Grade?

A textbook definition of carrier-grade networking puts failure “blast radius” and recovery as the two primary attributes among equals. Other attributes include the ability to scale, easy operations and management (O&M) of the network, and the use of standard interfaces.

To learn more download the white paper:

NCNF: From Network on a Cloud to Network Cloud

Scale definition is both trivial (more is better) and elusive (how big is big?). Use of standard interfaces is for the purpose of interoperability with other network elements. And O&M boils down to making network OpEx as low as possible.

It is the failure impact and recovery speeds that directly translate to the high availability of the network, also known as “five nines” (99.999%). These two attributes are what boosts the confidence of operators to run mission-critical workloads on the network infrastructure.

What does Carrier-Grade Mean for DriveNets?

Now let’s look at how the above relates to DriveNets Network Cloud. Implementing OCP’s public open specification of a distributed disaggregated chassis (DDC) covers two of the above attributes.

First is scale. A DDC is defined as a breakdown of a chassis to its building blocks in a way that enables building any size of a Clos topology cluster. Size can range from the capacity of a stand-alone device and up to the capacity limits of the ASIC radix, removing the limitation of the chassis’ metal enclosure. New ASICs will further boost the max capacity that makes this design, literally, as big as it gets.

Second is standard interfaces. OCP creates a clear and open definition of all internal interfaces so that every component of the network cloud is clearly defined and easily replaceable. It also defines all external interfaces to run standard protocols so communication from the outside into the network cloud is seamless. This is what’s protecting the customer from vendor lock-in.

The third item is the orchestration package that takes this Clos topology cluster of elements and makes them all behave and appear as a stand-alone network device. This translates to fewer network elements to manage; topped with automation tools, the O&M of a Network Cloud-based solution is in fact easier than a traditional network.

Fast Recovery Further Increases Network Availability

The next two items are where Network Cloud shines the most. “The bigger they are, the harder they fall” (and the hardest they are to pick up…) is an accurate description of a monolithic chassis architecture.

Highly-Available-Network-Infrastructure-Train

Figure #1: A derailed train in California*

(*Happily, no fatalities were recorded in this train wreck as it was not running a mission-critical task.)

By separating the components of the chassis and having each act as a stand-alone element, the failure radius is limited to the biggest building block rather than the whole building. Creating mutually exclusive data plane paths for network services becomes a trivial possibility that results in failure radius shrinking to a practical zero, and it’s an attribute controlled by the operator. The OCP-driven uniformity of the white box building blocks makes spare parts handling a simpler task, and recovery from a physical failure can be handled in a matter of hours by local non-network-professional staff. This fast recovery further increases network availability.

On the control plane side, the same distribution is achieved with the creation of the DriveNets Network Operating System (DNOS) from the ground up as a cloud-native NOS. The control plane resides on multiple devices running different or redundant functions. This translates to the elimination of cross correlation between functions and, more specifically, failing functions (rare… but they do happen). And when they do fail, there is always an alternative instance to step in and keep the control plane active.

Highly-Available-Network-Infrastructure-Resiliency-of-a-distributed-structure

Figure #2: Resiliency of a distributed structure

Being a Truly Carrier-Grade Solution

Now let’s circle back to the seemingly naive definition of “carrier-grade networking” simply meaning “a network run by a carrier.” If a network solution is missing the above-described attributes, the typical carrier will simply decline using it. If every failure is an “earthquake” whose recovery takes days, no carrier in its right mind would dare to mount mission-critical payloads over it.

Clearly, it is not the carrier using the solution that makes it carrier grade but vice versa. If a networking solution is not meeting carrier-grade criteria which are above and beyond existing alternatives, no carrier will deploy it in full scale into their production core network. This is especially true if the solution is implementing a technological disruption and is being provided by a relatively small Israeli startup.

If and when a carrier does so, you can trust that high availability was brutally tested. Just ask AT&T

Download White Paper

NCNF: From Network on a Cloud to Network Cloud

DOWNLOAD