What’s the problem for network upgrading?
A network device like a router is always “on.” While some deployment scenarios could position a router in an environment where downtime is acceptable, in most use cases a router is allowed minimal down time even when it is undergoing a software upgrade.
The common criteria for a router in the telco space is referred to as “five nines” – indicating the 99.999% of the time when the service should be up. Roughly, this translates to a maximum of ~five and a half minutes of downtime per year.
The software running on the router is by itself complex. It is running multiple protocols on different interfaces and operating on hundreds of components while not allowed to stop for recalibration or a reset of counters and registers.
Now, take that stringent set of requirements from the software and add swapping the software with a newer version.
What can possibly go wrong with network upgrades?
The new software is typically, well, new. It has less run time, and has run in fewer scenarios and for shorter duration, than its years-old predecessor. The likelihood that the exact network scenario that is currently running was tested before are close to zero. Moreover, an upgrade introduces new functionality, and this might change the behavior of existing functionality.
Databases are accumulated while the system is being configured and run. The upgrade needs to know how to replicate these databases and recover them into the new version. There is no room for error here – they need to be 100% identical.
An upgrade can be done from the previous version or from a few versions earlier. This is such a painful action that operators prefer to skip releases and only upgrade when they must. This means their current version can be 3-5 generations older; the changes between two such versions is much greater, which leads to a higher mistake rate.
Sometimes, the system just doesn’t recover from the reset, or some interfaces fail on rising with the new software. When this interface is your data link, you can reset it or even revert to the previous version. If it’s your management link, you’ve lost your system completely. It might be running data perfectly, but you won’t know about it.
This is not unique to DDC type solutions, but to routers in general. There are a few pointers I can refer to which can be helpful in general, and some which are unique attributes of DriveNets Network Cloud, our implementation of a DDC.
Tips and pointers for network upgrades
- Create a local config as backup
- Create a local config as backup – this isn’t a typo but a point of emphasis!
- Keep a repository with the upgraded element’s config as well as the entire network
- Make sure this repository is up to date
- Download the new software beforehand – don’t wait for the maintenance window to open
- Use canary upgrades – both for applications using the network and with a less central element of the network
- Use DriveNets Network Orchestrator (DNOR) to control which components are being upgraded (firmware, base OS, protocol stack, etc.)
- Use protocol metric to divert traffic from the upgraded element before the upgrade, and gradually return traffic to the element once it’s up and running
Reducing risk for network upgrades with distributed disaggregated chassis
Part of the risk factor is the fact that the router is a very complicated machine. Simplifying the router would surely reduce the overall risk of an upgrade – and this is exactly what the distributed disaggregated chassis (DDC) model does. Each component of the DDC is upgraded as a standalone device. This means the upgraded device is a lot simpler as a machine than a vertically integrated chassis-based router. It also means that the impact of a failed upgrade is reduced to a secluded island.
Diverting traffic from the upgraded device can be done within the DDC itself. This, in fact, creates a scenario where there is no network downtime at all and no maintenance window is needed. “No maintenance window” is not a myth – it’s just around the next DDC corner.
Sleep better at night by simplifying and reducing impact of errors
Taking precautions is highly recommended before acting on such a complicated task.
With that, simplifying the task as much as possible and reducing the impact of errors will allow you to sleep better while your team is performing the upgrade. In the future, you may even be performing upgrades during regular working hours thanks to DDC.
Download white paper
Which network architecture is right for you?