As Thomas related the story, it was right in the middle of woodpecker mating season, and he was told that a bird had just about scrubbed their mission. At first Thomas thought it was a joke, until his commander confirmed that, yes, STS-70 was off the pad for a few days or perhaps even several weeks.
NASA engineers were called out to the pad to review 71 deep “excavations” and many other beak and claw marks in the external tank’s foam insulation. Apparently, the woodpecker, trying to make a home for itself, had mistaken the fuel tank for … a tree. In the end, the fuel tank suffered some 205 holes logged in over 100 hours of footage passively recorded by NASA, that didn’t trigger any alarms.
Returning to the launch pad a week later, a raft of measures was introduced to ensure that no more woodpecker damage would happen. But as a protected species, no harm could come directly to these birds. So, NASA resorted to some less than cutting-edge solutions. NASA staff were asked to stand at various levels on the pad over the weekend and blow air-horns if woodpeckers came near, and they put fearsome-looking Predator Eye balloons around the launch complex. This did the trick, and no more misfortune befell STS-70, which eventually launched on July 13th.
This story reminded me for when there is a problem in a monolithic chassis-based network infrastructure, that the whole system can come to a stand still. I wouldn’t say that the failure of one component impacts the whole infrastructure, but that the whole infrastructure can be looked at as really just one component.
Limitations of the Monolithic Chassis
The more networks grow, the more complex they become. In the past 30 years, networks have increased their reliance on hardware equipment, nodes and trunks, making them difficult to install, expand, manage and maintain. The old rule of a “single service router” typifies much of today’s network architecture, meaning that for connectivity of any service, whether a mobile backhaul, provider edge, business service, internet peering or core network, it is tightly coupled with its own batch of dedicated chassis-based routers.
Serving as the foundation for legacy networks, the traditional monolithic chassis adds extra operational complexity. Monolithic chassis-based devices come from the same vendor who build the whole network, selling it as a single black box, and lacking granular visibility and control. With monolithic software serving as a potential source for compatibility issues and bugs, this seemed like the superstructure to be brought down by the proverbial ‘woodpecker’.
Clearly when sending a shuttle into space, there are certain aspects of size which are believed to be mandatory but networking is not like that (now it isn’t…) When a critical device or network service experiences problems, a blast radius of the network is impacted. The function of the device that initially fails determines the impact of a failure domain. For example, a malfunctioning switch on a network segment normally affects only the hosts on that segment. So when you have a big chassis and it collapses, it damages a complete sector of your network.
Time to Repair
The “time to repair” is far longer with the huge chassis components in incumbent router models. Expert personnel are required on site to handle the swap. A large inventory warehouse needs to be maintained. Trucks need to roll to switch the faulty one for the replacement equipment. There is no option to make repairs remotely when ‘woodpeckers’ affect the network.
Network Complexity Drives OpEx
The classic chassis, which was, and still is, the basis for most networking functions, is built in a manner that runs as a “black box” with no indication as to what happens inside. This network complexity leaves operators blind when capacity needs to be increased, a new service launched or a network issue needs to be fixed.
As we saw with the outcome of the woodpecker incident, NASA OpEx exploded, needing to add layers of protection (eagle balloons, sirens, watchmen…). The blindness in managing a monolithic chassis is costly – impacting the network’s operational cost. A large network admin staff needs to observe mechanisms, maintain an inventory of spare devices, carry out cumbersome replacements and stick to update cycles. With more networking scenarios being addressed, these costs expand, significantly draining operator investments, while not yielding any benefit. In other words, wasted money.
With the potential failure domain for networks immense, operators must do anything in their power to prevent it, throwing everything at ‘woodpeckers’ in the network.
Building Networks like Cloud
Building networks like cloud means that the architecture is disaggregated and more importantly distributed. This means that when a problem occurs in one part of the network, it doesn’t impact the whole infrastructure. This gives operators the ability to operate certain functions on the network, like an upgrade or a change in the network without actually impacting the availability of the network at all. This is what DriveNets Network Cloud is able to do because it is a distributed network, offering a surgical-like accuracy. In a way, it’s kind of breaking the problem into many small problems, presenting higher reliability, which is important for everyone in the days where internet serves every aspect of our life.
The smaller blast radius or failure domain and network maintenance procedures isolate issues from the rest of the network. Since DriveNets’ disaggregated router model provides independent console access to any router component, operators can investigate and repair issues remotely, without requiring an on-demand, on-site presence, dramatically shortening the time to restore a device, and significantly reducing Opex.
Network disaggregation means that service providers can choose their vendors for building a best-of-breed solution that includes silicon, white boxes, and software. Network disaggregation brings a new operational model, delivering the same or greater efficiency as incumbent routing solutions. The ecosystem involved around a disaggregated model provides a comprehensive solution that supports the critical steps of network rollouts and maintenance, providing a single point of contact for any support issue.
Where a traditional chassis “hides” all the “insides” from the user, DriveNets Network Cloud has deep insights to the component level, detecting (woodpeckers) failures faster – before their impact is felt on the end user.
Worth noting the woodpecker mission badge that was produced to mark the incident – a “shameful” sign of how inefficient practices result in costly processes. While the parallel to this badge doesn’t exist in networking, we can see that while NASA has the ability to joke at their own expense, network admins need to hide these colossal failures. After all, no one wants to prove the expression wrong: “Nobody ever got fired for buying Cisco.”
Download the ACG Research Report
Going Forward with Disaggregated Cloud-Native Routing