DriveNets Taps Disaggregation to Build Networks like Cloud
Why is it time to take the disaggregated model seriously?
Heavy Networking talks with DriveNets about why it’s time to take the disaggregated model–where you buy whitebox hardware and put a network operating system of your choice on it–seriously. Along the way, we’re going to hit DriveNets network architectures and operating models, and get you thinking about why disaggregated networking might make sense for you.
Full Transcript
[Ethan Banks]: If you’re the average network engineer working on the average network, you probably haven’t shifted to the disaggregated model. Most folks are still buying and installing vertically integrated networking hardware and software, that is, if you buy a box from, say, Juniper, you bought it with Junos on board. It’s integrated. That’s the model we’re talking about.
Our sponsor today is DriveNets, and we’re talking about why it is time to take the disaggregated model, where you buy white box hardware and put a network operating system of your choice on it, seriously. Along the way, we’re going to hit DriveNets network architectures and operating models and get you thinking about why disaggregated networking might make sense for you and the business you run a network for.
Our guests are Dudy Cohen, Senior Director, Product Marketing, and Run Almog, Head of Product Strategy, both at DriveNets.
Dudy, it’s been a while since DriveNets has been on Packet Pushers, and so we need to do a review here. You guys haven’t been on since episode 517 back in May 2020, so not too long, but long enough that– give us the elevator pitch of what DriveNets is. Network engineers want to hear this, so that nice punchy 10,000-foot view.
[Dudy Cohen]: Sure. So, in one sentence, we build networks like cloud. That means that we take the networking functions, be it a BGP router or a firewall or whatever network function you have in your network, and we run it inside containers. We run it as micro services over shared infrastructure, which is built from white boxes, from commercial off-the-shelf white boxes, which you can buy from your favorite ODM provider.
We do it in order to create an environment in which you can put any network function over the infrastructure you have.
[Ethan Banks]: So, any network function? When you say that, I’m thinking by reflex: routing, firewalls, load balancers, these sorts of things?
[Dudy Cohen]: Absolutely. DriveNets builds the virtualization layer or the hypervisor that abstracts the white boxes, the hardware, towards an application that runs on top of it, and those applications can come from DriveNets with the routing functions, or from a third party, which provides the load balancer, or the firewall, or the DDoS mitigation function.
[Ethan Banks]: Okay. Since we chatted with you folks just over a year ago, has anything fundamentally changed, or any major new products that you’ve brought to market?
[Run Almog]: Wow, a lot has changed, a lot has been done. The previous call we had was roughly when I joined the company and it’s been a rollercoaster ever since. First of all, AT&T has made public, they made the announcement that, in fact they are running our solution in their core network. We can dive into the details of this a little bit more, moving forward. We introduced multiservice routing, so multiple different routing services can run on the same instance of the network cloud. Previously, it was one instance. The concept was there, but now it’s also implemented. As Dudy mentioned, we can run multiple different services as well: not only stuff that is being developed here in DriveNets, but also third-party companies who’ve been very professional in building firewalls. We’re not trying to compete with these or build something alternative, but just use what’s existing as a network function, as an instance that runs on that infrastructure that we call Network Cloud.
Finance: we are a unicorn now, and we weren’t before. So, we recruited 208 million dollars – over one-billion-dollar valuation, which is very cool.
[Greg Ferros]: So, let’s just turn that into something that people can leverage. If somebody’s going to give you 208 million in funding, then they believe that you’ve got a story going forward. Now, it’s not like they gave you a wallet full of cash and said “here, go and spend it”, that’s all. But that’s something that comes over time. But the point is that you’ve now got a business story that says, “we’ve got a product that’s viable. We’ve got key anchor tenants. We’re a strategic business partner to key people, so now we’re ready to partner with more organizations,” right?
[Run Almog]: Absolutely, that’s exactly the point. It goes beyond AT&T. AT&T is one example, there are other customers where we are already in deployment and many others where we are engaged. There is a wider span of our people globally, there is a wider network of partners that we are working with in various places. So, the go-to-market strategy is much more robust, the target market is better defined and it’s big. That would explain why investors are looking to kind of jump in on the wagon and take this ride with DriveNets.
One other aspect which I can mention is that disaggregation is becoming some sort of a standard. Not standard by the protocol definition of it, but TIP, the Telecom Infra Project, has launched an RFI a couple of months back, which is a very close definition of what we are doing and of course, they are defining this as the de-facto standard in the industry. The results on this have not been made published yet, but we are…
[Greg Ferros]: So, this is, again, just to make that relevant to an engineering audience. This again, the TIP project is actually not so much defining a technology stack, it’s defining the APIs between the sections. So, in the same way that IoTF sort of defines protocols, TIP, the Telecom Infrastructure Project defines the components, and then the sort of bonding that can go between each of those components, so APIs, models, data exchange, that type of stuff. So, that is key that you’re participating in that, because if you’re into the telecoms infrastructure or you’re into that sort of large systems, you need to understand how your suppliers are going to fit into that model.
[Run Almog]: Precisely.
[Ethan Banks]: So we mentioned disaggregation, we’re talking about it being a standard, we mentioned TIP and you’d think everybody in the world was all of a sudden going with this model, but that actually isn’t the case. So, for those folks who are trying to get their heads around this, why they would go with the disaggregated model? Explain it to me from a business perspective. What opportunities does disaggregation open up for network architectures, for businesses that– What can they do that they couldn’t do before?
[Dudy Cohen]: Okay. So, actually quite a lot. When you go for disaggregated architecture, you can gain some scalability and service flexibility in your network. That means that, for instance, if you want to deploy a new service across your network, you do not have to run with truck rolls and dispatch crews to install new hardware. Because the hardware is there, the hardware becomes a generic part of your network. And when you talk about disaggregation at the cloud native architecture, that means that you can have service placement according to the available resources and parameters of the network like capacity and latency, which are available at a given time and locate those network functions wherever you need in the network.
Now, I agree that disaggregation is not the de-facto in most of the network, but in other networks domain, if you look at for instance 5G core architecture or data center connectivity, it is there. 5G core networks are disaggregated, everything that has to do with what happens inside and between data center is disaggregated. So, different network domains go on with disaggregation at a different pace, but I think that operators are starting to understand the great values of disaggregation and adopting it.
[Greg Ferros]: It’s been a long ride though, right? Getting people to accept that the hardware is separated from the operating system, it’s separated from the applications, can be separated from the software-defined controller… it has been a long ride. We’re 10 years into that or maybe longer even, I think. When did we start talking about it, Ethan? 2008? Something like. We started to see glimmers of SDN sort of starting and in 2011, we really saw the first generation of controllers come along.
It’s been a long slow road, but is there anything happening now? There’s a tipping point. It feels to me like everything is software-defined. Like, the end of finger-defined networking is inside.
[Dudy Cohen]: I think there are two tipping points. One has to do with the business. So the fact that operators are not just counting the money on the way to the bank anymore, but struggling to maintain their profitability, means that they need to change something in how they build the network. But also from the aspect of technology: if you talk about SDN and, more than that, if you talk about NFV and VNF that were supposed to take the disaggregation to the networking world, this was okay for the control plane to some extent. But when you wanted to do it with the data plane, it was just not scalable enough and not cost-efficient enough. What we have today is hardware that it is networking optimized, you have ASICs, you have NPUs (networking processing units), and that means that you can have a very efficient networking box, which is still commercial off-the-shelf white box and then you can use it to separate the hardware from the software, and the software from the application, etc.
[Greg Ferros]: So, this comes down to the idea of hardware disaggregation, and the answer is that most networking vendors are actually all using the same ASICs and the same motherboard designs today. And there’s only three or four makers of ASICs and there are some differences between them, but really, they’re all the same these days.
[Run Almog]: Well, in a way, yes, but I’d like to kind of go back to your question. And I was facing this question almost 15 years back when, really, disaggregation just started. And the question was, who does it belong to? I mean, who should be the user of disaggregation? And my answer was “not who, but when.” And over time, this place in the network or disaggregation fits evolves. Now, you mentioned only a single ASIC vendor or just a few, and when I look into the data center domain, where this area is more evolved and there are multiple players over there, the differences between them are not that vast, which is actually good and there is also a mutual layer that kind of interconnects the relevant software to any of these potential ASICs, which enables more options in terms of choice. And this choice gives the power, or the control, over the network back to the hands of the user, back to the hands of the customer, where in a vendor-lock type of mechanism, which is predominant in the last 30 years, this was at the hands of the vendor.
[Ethan Banks]: Let’s define disaggregation from the DriveNets perspective. We’ve been talking about it as if everybody understands exactly what it is, but in fact, it can mean several different things. It could be just an open-source NOS writing on white box, and that’s disaggregation. Or it could be, as we’ve been mentioning as we go here, a complete separation of control and data planes where what’s happening in the control plane is on a completely separate and off box from what’s happening in the data plane. What is your model for disaggregation?
[Dudy Cohen]: No, I would say it’s all of the above and more, because as you mentioned disaggregation starts from disaggregating hardware from software. What we added to that is the ability to cluster, to distribute the hardware and use very simple building blocks. We have just two main building blocks in our hardware portfolio. In order to scale those up, we simply rack and stack them and look at the whole cluster, the whole bunch of white boxes, as a single hardware entity. So, this is one very important phase of what we do.
[Ethan Banks]: So, hang on. So, just to read that back to you. That sounds like sort of like the chassis model, only I’ve got a bunch of fixed configuration switches that are sort of like line cards in a chassis and I’m managing one gigantic entity. That’s how I heard that, is that about right?
[Dudy Cohen]: Yeah, but you manage it in a manner that reflects the resources of all those chassis, all the NPUs, all the compute resources, all the TCAM resources, are gathered into a shared pool of resources. So, you don’t care where your TCAM– on what white box is your TCAM or what white box serves your port connectivity, NPU, CPU, etc. This is transparent to the application. What DriveNets brings is actually an abstraction layer or a hypervisor, if you may, that mimics what VMware did to the compute world, but it reflects NPU and CPU and TCAM resources towards the applications that runs on top of it, in containers. While those applications can come from DriveNets, but can come from other other vendors like security vendors.
[Ethan Banks]: Now, wait a minute though man, because TCAM is finite and you can have TCAM in a box and it’s only going to have so many entries that you can plumb in with forwarding entries. And I can’t be needing to forward from one box with a certain amount of TCAM in it, and if I’m out of TCAM, I can’t just say “Oh, I’ll forward it through a different box”, because I might not be plumbed that way, right?
[Run Almog]: Yes and no. You’re right, you’re absolutely right. The thing is, all of these boxes are not acting as separate boxes in the network. The internal algorithm that we’re using can kind of allocate different portions of the TCAM to different applications within the same cluster of what we call NCPs, or line cards, if you want to compare it to a chassis. To your point, a chassis is not necessarily a bad thing. I mean, chassis has some good things to it: having a single entity, simply managed as a single entity with a huge capacity and a lot of capabilities is a good thing. The problems are, it’s heavy, it’s expensive, it’s limited in terms of its ability to scale and change the capacity or change its topology. It’s rigid, it has a metal enclosure around it. This is what we’re trying to kind of take in, all the good things about a chassis.
[Ethan Banks]: Oh, you won’t get a fight from me about that. I mean, I completely agree.
I’ve installed my share of gigantic chassis that had to live in this rack for the next 10 years because of what we spent on being able to manage like a chassis, but have a bunch of fixed configuration switches that are 1U or 2U that I can swap out at will as needed, that– yeah, yeah, I get it, I’m with you on that.
[Greg Ferros]: Oh, my favorite…This black plane is completely passive, except for the times when it’s actually got active electronic components on the back like clocks, or there’s one chip which fails and now I have to replace. Or my other favorite one was, “oh, yeah, no, we know about that chassis. Every time you put a card in it bends the pins, it’s quite well known if you’re not very careful.”
[Ethan Banks]: Yeah, I’ve had that.
[Greg Ferros]: I’m not a huge fan of chassis, just in case you didn’t notice. So, yeah, I’m with you.
[Dudy Cohen]: You may say that we took the good things from the chassis and threw away the bad things. It’s important to understand that the cluster is not just a bad bunch of white boxes connected to a switch. Those white boxes have two roles: one is the packet forwarding role, which is the NCP, the network cloud packet forwarding, and then the NCF plays the fabric. And the connectivity between them is a fabric connectivity. So, if you need to transfer traffic from one white box that gets the interface to another that has the TCAM resources or other resources required for this traffic, this is done over the fabric. Even though the fabric is distributed, it is still not traffic that burns into your valuable points.
[Ethan Banks]: So, it’s not quite right to think of DriveNets as leaf spine exactly. There’s more to the story here is what I just heard.
[Run Almog]: Precisely the point. It’s about taking multiple boxes and make all of them behave as one, as one single network node.
[Ethan Banks]: Okay. So, what are my hardware choices then? We’ve been talking about white boxes, we mentioned chipsets, but you can’t use every chipset that’s out there I’m guessing, or can you?
[Run Almog]: Potentially, or theoretically, yes, you can. In practice obviously this is about implementation and how does the software relate to the hardware, so currently we’re working on Broadcom-based devices. Within the existing portfolio we have Jericho2 type of chipsets, the Ramon for the fabric, both coming from Broadcom, the new ASICs from Broadcom, the J2C+ is coming in as additional boxes, which are being added to the portfolio. There are some additional–
[Dudy Cohen]: Those are some nice boxes? We have a two-rack unit white box that holds 36 interfaces of 400 gigabit ethernet. That’s like 14 terabit per second in two rack units, that’s a nice box. I like this box.
[Run Almog]: And to that– Yeah, nice density. What’s important to note is that, as opposed to a chassis , and Greg, you just mentioned that a few minutes ago, the backplane is not a limitation, because the backplane is a passive electrical or an active electrical cable that connects to the fabric, which is being placed remotely. So, the same fabric will act regardless of what kind of line card boxes you’re using.
[Dudy Cohen]: You should tell him about the largest router in the world, we have it right here in the building.
[Greg Ferros]: Yeah. Well, I was going to say that the thing is, because you’re building line cards from the standard 1RU, 2RU off-the-shelf switches and you’re bonding them together, you can actually scale beyond the size of a backplane. A chassis is limited by the ability of the signals to travel up and down a backplane in a chassis and so, it can’t be more than a couple of meters long and most chassis backplanes can only be a meter because of the speed of the clock signal. But you’re talking about, by coordinating the careful configuration, by coordinating the TCAM table and only downloading into the TCAMs what needs to be downloaded and carefully considering what needs to be in the line card, (which is actually what happens in a chassis by the way, or used to, back in the old days,) you can actually scale this disaggregated idea beyond to bonkers sort of size, 30, 40, 50 slots?
[Run Almog]: The number of slots eventually depends, or if there is any limitation, it’s the radix of the ASIC, that’s it.
And even that is something that can be broken into sub-lanes on the SerDes level. So, we can take a 4 SerDes interface and break it into 4, so we can quadruple the size of the number of interfaces on the fabric. In practice, it’s hundreds of terabits. In theory, it’s not really limited.
[Greg Ferros]: So, really what you’re referring to there is the number of ports on an ASIC and the speed, and whether everything’s a two-tier leaf-spine architecture, right? So, it’s not like there’s any magic going on here in terms of the physical hardware, the magic is all in the software that will be able to make effectively an infinite-sized chassis.
[Run Almog]: Exactly, exactly that.
[Dudy Cohen]: And because of an infinite size chassis is an overkill to any single application, the thing is, we enable multiple applications or practically almost any network function to run on this chassis. So, you build a very large chassis, but this is the only thing you need to build.
[Greg Ferros]: So, coming back to what you said before about you’ve got AT&T as a reference customer now and you can talk about the things you can do. I imagine that what attracts them to you is the fact that they can start with a very small two, four, six, you tell me, six switch leaf spine, but scale it up to 20, 30, 40 and it’s the same software pattern, it’s the same design pattern regardless of where they deploy that solution?
[Run Almog]: The design is targeted for, or what they started with, is roughly 200 terabit per second. It’s not to say that the implementation was as big on day one, but that was the targeted design. So, they can wiggle their way around this capacity or smaller. They didn’t ask for anything bigger than that until now, which is as we mentioned possible, but hasn’t been implemented just yet. Actually, they want to go small in many cases because we started core, and core is the highest capacity, so we’re looking into other applications within AT&T or other customers and they’re actually looking for something a bit smaller or changing in terms of size. Scaling is not only scaling up, it’s also scaling down. Changing the size of the implementation or changing the location also has an impact in terms of scale. And one more comment, it’s not just about scale, it’s also– and it’s a lot about – the functionality. A core router has certain features running and an aggregation router is a little bit different and a provider edge router is different and peering router is different in terms of the feature set. I’m not even diving into the options of multiple, different third-party services, which can be mounted on top. So, the variance is great not only in terms of the scale, but also in terms of the potential capabilities.
[Dudy Cohen]: And by the way, when you talk about scale it’s important to understand that we’re not talking about what we know as network engineers as an upgrade process, in which you need to plan, you need to notify the customer, you need to come in at night, you need to roll back from time to time. This is an approach of never upgrade again while you scale your capacity and functionality. When you need more hardware, you just you just add boxes to the cluster, this is not service effective. You can do it and the system orchestrates and automates, bringing into service. When you need to upgrade the software, you upgrade a specific container and a specific functionality that is isolated from the rest of the functionality over this cluster. So, those are baby steps in which you can upgrade the network forever without having a rip and replace brutal act of upgrade.
[Greg Ferros]: And I think the other side here too is that, if I was operating a network where I had that many devices in it, I don’t have one operating system for the six-slot chassis and another operating system for a 12-slot. I don’t have one API that runs on different operating system to a router like, if I’ve got an edge router over here, an MPLS edge, and over here is a core router and it comes from– and it’s got different properties, so it runs a different operating system and a different API and I’ve got this—
[Dudy Cohen]: You have different spare parts.
[Greg Ferros]: Yeah, but to me it’s the software problem, right? Spare parts is spare parts and that’s a solvable problem, or at least that’s what executives say. Stupidly, but that’s what they’ll say, right? And they’ll pretend that “oh, it’s just a hardware problem, we could sort that out.” Of course, they can’t, they’re incompetent, they can’t even organize an excel spreadsheet properly most of the time.
But, I mean, the point here is that if I’m writing an application on top of it, a software app that can do the configuration… Telcos have the operational consoles that they use to do network provisioning, and if you’re trying to talk down to 20 different brands or, even if it’s a single brand, but you’ve got 10 different operating systems or versions of operating systems, that’s a hard problem to solve in terms of deployment and maintenance, right? Whereas if I’m going with DriveNets, the software that you’ve got and we’re going to talk more about and move away from architecture to the software architecture, this idea of container-driven, software-centric, not so much dependent on the hardware, means that that model changes around somehow.
[Run Almog]: Absolutely right. And the thing that we are facing the most in terms of a challenge is to get our customers, who are network-centric or network-oriented, to gather or to grasp this concept which is very cloud by its nature. The fact that software controls everything, the fact that containerized functions can be turned on and off, this is something that network engineers find it– I wouldn’t say hard to understand, but different than what they’re used to.
[Ethan Banks]: Well, so describe Network Cloud, then. I know that’s one of your products. I was reading up on it to prep for this show. As I was reading up, we’ve got the network operating system, the orchestration layer, and then the hardware, and then it all is cloud native. So, put that together for us and help us write as network engineers understand how that works. You alluded to it a minute ago and you caught my attention with the “yeah, take a container, upgrade it and you didn’t have to do a full monolithic OS rebuild.” We’re seeing more and more of that in the networking industry, that is attractive, but again, back to DriveNets Network Cloud, walk us through it.
[Dudy Cohen]: Okay. So, we have a great slide for this, but try to imagine a layered model which on the bottom you have the white boxes. So, this is a group of resources that includes computer resources, networking resources, TCAM, whatever you need in order to build a networking function. On top of this layer, you have the hypervisor, the virtualization or hardware abstraction layer. This comes from DriveNets. The white box come from any ODM vendor that is certified, the basic hypervisor functionality comes from DriveNets, and actually takes all the resources in the cluster and abstracts them into a shared pool of resources. On top of this solution, or part of the solution, we have the service instance layer. This is where the actual network functionality is created. It comes in containers, you have multiple service instances on top of this shared hypervisor that abstract the hardware, and in each service instance you can have a BGP router, an IS-IS router, a firewall, a DDoS mitigation, a 5G function, whatever you have, and each comes from a different vendor.
DriveNets provides the hypervisor as well as the routing service instances, and it interoperates with third parties that provide the non-routing network functions. Now, this is quite a mess if you can imagine it, there are a lot of blocks and a lot of stuff to manage and to bring up, and this is why we have another layer on top of it, which is the orchestration. And this is super important and we put a lot of effort into it, because this is the one that acts as the virtual chassis, this is the one that wraps it all up in a very easy to manage, very easy to plan and very easy to maintain manner. That means that when you add a white box, you don’t need to go and configure the white box and then update the hypervisor that you have more resources, and then allocate those manually to the different SIs that run on top of it. The orchestrator does it all, and of course, it has north bound interfaces, gNMI gRPC Netconf/YANG, what have you towards an upper layer, end-to-end orchestration system or management system or OSS, BSS, whatever.
[Ethan Banks]: Now, all of those layers as they’re stacked, do they all run on the white box hardware or does some of it run on the white box and some of it run on, I don’t know, an x86 cluster or something?
[Dudy Cohen]: So, the nice thing about the white box hardware is that it has networking resources, NPUs and CPUs, but if you have some functionality like DNOR, the orchestration, or the control plane functions, there’s not much point in running those on the white boxes. So, as part of the cluster, we also have servers. But this is just the optimization of the hardware resources you put into the cluster. You still look at everything as a gigantic pool of resources.
[Ethan Banks]: Yeah, yeah, the cluster is the thing and you put into the cluster the resources you need to run the components in the most efficient way. So, right, for control plane functions, for orchestration functions, you’ll have some x86, some regular old compute in there, but it kind of doesn’t matter from a standpoint of network engineering and architecture I was thinking about it. I have a DriveNets cluster that does these things, I need to have a certain amount of resources that are in there. The abstraction layer, the hypervisor if you will, is going to put all the different functions where they need to be. That’s just a different way, you got to get your head around that if you’re me, because and again, so many network engineers so used to the model of everything runs on the device, on the white box in this case or on my legacy router or switch, separating all those functions like that is a thing.
Now, here’s another interesting question. If I can separate things like that, separate this cluster out, is my cluster typically going to be contained in a common data center or would I have some parts of the cluster up in the cloud, let’s say, and others on premises somewhere?
[Run Almog]: It’s a possibility. The components, which are very CPU heavy, can run in a cloud and in most cases or at least what we have deployed in most cases, it’s a dedicated server, which is on premise where the cluster is being structured in practice. In theory, it’s definitely doable that these elements or these CPU resources, heavy CPU resources, will run somewhere in the cloud. The thing is that, you don’t need to worry about that. I mean, this is a decision whether you want it on premise or not, but in general, the orchestrator takes care of running the right sessions or the right flows within the right resource or the best resource that can tackle that mission. That’s the target, that’s the purpose of what we are doing here. Otherwise, we are not doing our job right.
[Dudy Cohen]: And I think this is maybe another angle that is very relevant lately, the fact that operators like AT&T, for instance, are pushing some functionality outside of the network towards the cloud, like they did with the 5G core going into the Microsoft Azure infrastructure. This is a trend that I believe will be limited to a control plane functionality, because when it comes to data plane, when it comes to networking intense, you cannot find the right resources at the public cloud. Public clouds are based on x86 ARM, or otherwise CPUs and GPUs, but they lack the NPUs. So, for the foreseen future at least, I think the networking intense functionality will remain on premise, on a cluster, on a private cloud you can call it, but still, it will remain on premise.
[Run Almog]: Just to kind of comment on this. When you’re saying that these sessions can run in the cloud, the question is, where physically is this cloud located? And with edge compute taking cloud closer and closer to the network, we’re getting to the point where the cloud and the network are co-located.
[Ethan Banks]: Yeah, where it’s not as if you’re using public cloud and describing some AWS or Azure resources that’s whatever amount of latency away, and then asking it to do that work, yeah.
[Run Almog]: Exactly
[Ethan Banks]: But the point is, with the disaggregation you can run it anywhere and it could be on a, as you say, a cloud that is in-house pretty much, we got to call it edge, we used to just call it a data center, now we got to call it edge computing, but yeah, you can put it right inside and latency is no longer an issue. It’s just architecturally what’s nice or convenient for you to do, if you’ve got a cluster of compute there to take advantage of.
[Run Almog]: Exactly. So, the point is, one thing is this location item which we just covered, the other is the benefit of running in the cloud is the flexibility, because you’re basing your entire functionality on software. Software is as flexible as it gets. So adding new functionality or making modifications is all software-based and therefore, a lot more flexible than existing networks.
[Ethan Banks]: So, you said containers are where all the processes are running, they’re living in a container. Does that mean I have a Kubernetes cluster sitting off to the side, I can throw my DriveNets control plane onto a Kubernetes cluster? Is that a thing?
[Run Almog]: Almost. There is some implementation that we have done internally, it’s not all Kubernetes-based. We’re adding more functionality around this area continuously. The thing is that when we started, Kubernetes was not there, so there’s a lot of implementation which we’ve done in-house. Kubernetes is also not very network-oriented by its capabilities. So we kind of compensated for the gaps of Kubernetes with our own implementation.
[Ethan Banks]: Going to the orchestration layer then. You mentioned we got northbound and southbound interfaces as one would expect. How do I consume that orchestration layer typically? Is there a UI that you’re providing for me so I do a lot of that interfacing with the DriveNets cluster in that way or is it an API and I kind of need to build my own layer, so that I can interface with orchestration?
[Dudy Cohen]: So, the answer is yes [Laughs]
[Ethan Banks]: Of course [Laughs]
[Dudy Cohen]: So, there are actually multiple UIs that we provide because, when you think of it, while we grouped everything into a single cluster, still from the operational perspective you need to manage the router, you need to manage the firewall, and you need to manage the infrastructure layer. There are multiple UIs that provide different groups in the NOC with different views of the system. They still see the router as a router, and they still see the firewall as firewall, and now they see the infrastructure as a group of white boxes and the hypervisor layer. So, those are views that are provided to the operator, but more than that, if you don’t want to use our view or if you want to use a system that sees a more end-to-end view, we have the northbound interfaces to do it. I mentioned earlier we have NetConf/YANG, we have gNMI, gRPC. Whatever you need, just speak and choose.
[Greg Ferros]: And I think because of your architecture, you can easily add a shim in there. So if you need to change the API or adapt to add functionality or capabilities, it’s really– you’re not restricted to some arcane CPU architecture like a MIPS or like a lot of these legacy boxes are. You’re just running standard Linux and inside that Linux on, you know, some x86 architecture in your controller, in the DNOS and the DNOR pairing. It’s not mystical the fact that you can be flexible to some extent.
[Ethan Banks]: Let’s touch on AT&T a bit more. You’ve mentioned that they’re a marquee customer, they’re someone that you’re using– okay, well everybody knows AT&T – global provider, huge. What are they actually doing with DriveNets? Did they like, they replaced their entire infrastructure and their whole backbone and everything is now DriveNets? Or is it more of a special use case thing?
[Run Almog]: It’s a special use case, it’s called core. [Laughs]
It’s the core of the network, you’re currently running on DriveNets. You guys, currently based in the U.S.
[Dudy Cohen]: It’s kind of a niche, but [Laughs]
[Greg Ferros]: [Laughs] So, when you say core, you’re saying actual core transport or is it 5G core or is it a DWDM core? There’s lots and lots of cores in a big telco network like AT&T, let’s drill into that just a little if we can.
[Run Almog]: It’s the very core of the fixed network—
[Dudy Cohen]: The IP core
[Run Almog]: The IP core. The enterprise traffic, residential traffic is running over us, mobile traffic is essentially running over us even though it’s not the mobile part of the implementation. Eventually it all boils down to this backbone and this is exactly where we are located. This is why they needed that huge capacity of hundreds of terabits per second.
[Ethan Banks]: I’m guessing it wasn’t just about capacity, I mean, yeah, that’s very cool, but it’s also they must be layering on whatever customer services, peering agreements, and this kind of thing they’re pumping through the DriveNets orchestration.
[Run Almog]: Not only that. What was the most extreme requirement from AT&T was redundancy, or reliability of the solution. And this is where most of the effort was being put in place throughout the stages of deployment. Several layers of redundancy, recovery from all sorts of failure, mutual failure scenarios. We were really put to the test before we got certified or clarified to run full capacity within their network. In terms of additional functionality, AT&T is also using this technology for peering. That’s a known, it’s been publicized as well. Still TBD in terms of other areas in the network, but this is kind of moving forward.
[Ethan Banks]: Let’s drill into the redundancy component for a minute here, because there’s a million standard protocol ways you can do redundancy. You got, I don’t know, fast reroute, for example. Are we talking about just implementing more or less industry standard stuff or is there special DriveNets redundancy magic we should talk about?
[Run Almog]: Keep in mind that the solution needs to be standard, because we are interoperating with practically everything else out there and this other equipment is running standard protocols. So it’s definitely within the boundaries of the standard, but there is a very huge advantage to the fact that you are spread over multiple devices. So, the failure of a single device is always a very small failure domain, whereas when you have a chassis collapsing, although it’s not very common, but when it happens then your entire network or at least half the network is down. This doesn’t happen where you have a distributed model, where even a box collapsing has a very minor impact and because of the inherent redundancy and how we build clusters, we can work around it. So, if you take the recovery rate of a fast reroute and implement it into boxes, then a box failure is practically something that goes unnoticed in our type of network.
[Dudy Cohen]: And it’s not only about blast radius or the higher viability protocols, it’s also about the inherent fact that we built this software ourselves from scratch, in order to serve a purpose. And we don’t have the baggage of a huge operating system with many unnecessary functions and features that from time-to-time crashes and cause service failure throughout the system. We have a very distributed and focused software, and knock on wood, this is something that’s proven itself in AT&T and other customers. This system works and it works with extremely high availability.
[Ethan Banks]: Yeah, I wish I didn’t know what you meant about a big monolithic NOS running 5000 million services most of which I never use, causing my box to crash.
But I know exactly what you mean, yeah.
[Greg Ferros]: I know what you mean because I once worked in an institution that had a very large chassis-based switch and the upgrade process meant to taking it offline for approximately 22 hours maybe, right? And the way that the upgrade process worked is that, because they delayed the upgrade and the software needed a custom architecture in the chassis and there was 18 line cards active and blah, blah, blah and in the end, we were going to have to shut the entire financial operation down for a whole day and the rollback process was, there was no rollback process.
[Dudy Cohen]: This is crazy. Amazing that business continuity was even a phrase.
[Greg Ferros]: Yeah, and this was a chassis that was sold as a highly available, maximum uptime, maximum stability, maximum reliability– yeah, I have scars, personality scars, mental scars from working with chassis.
I am here for this this type of product. Just thinking.
[Run Almog]: This is a question which is common where we are facing, where is the NOS actually running? And is it running on the NCP, is it running on the fabrics, is it running on the server or is it flying in the cloud? And the answer is yes, just like Dudy said before, yeah, it’s running on all of these places. So, there is no one point where the NOS can collapse.
[Greg Ferros]: Right. The flip side of that is that I assume that you’ve been able to prove to your customers to date that complexity. Because when you distribute the functions around the place, that complexity then becomes a failure point in its own, right? But you’ve obviously been able to convince customers that it’s not a real problem, that that’s just something that you perceive.
[Dudy Cohen]: Well, it is a real problem, but we solved it with an excellent orchestration system. So, the orchestration does just that. It mitigates the inherent complexity of the disaggregated system and then goes further with automating tasks you are now doing manually, etc. It is a problem, but it is a solved problem.
[Run Almog]: If you want to look at where the magic is, this is where the magic is, right? This is where our patents are located.
[Greg Ferros]: Not in the NOS that you put on the switch, not in the APIs, not in any that, which it’s not some routing protocol that’s got patents on it, right? It’s in the software and the algorithms that drive the downloads.
[Run Almog]: Exactly. Everything towards the outside is completely standard, the inside is where the magic happens.
[Ethan Banks]: What’s on the road map for DriveNets? You guys have been busy building, building, building, so what are you building next?
[Run Almog]: All right. When it comes to services, there is a lot coming, there are multiple services that we are working with and different companies we’re working with towards that direction, and the plan is to put all of this in kind of a marketplace, so it’s going to be easily consumable. A service provider running a network cloud will simply log into this portal, choose what kind of service and launch it into its existing network. So, launching a service will be as easy as clicking a button, as opposed to, I don’t know, pushing out trucks and running a pilot and potentially failing that pilot, so rolling back the equipment, purchases and so on. So, kind of a one-click introduction of a new service. Network API is something that we’re working with, with several other vendors. This is something that will need to go and become a standard, an open API obviously that anybody can use, so to make this into something more official and more public.
[Ethan Banks]: Good luck with adoption on that, but I hold out hope. I would love to see something like that happen. So, yeah, bravo on that. For the people in the audience that are listening and they’re keen, they want to engage DriveNets, they want to try out disaggregated networking, what’s the process? Where would you send them? How can they find out more info?
[Run Almog]: Well, first off, they can meet us, we are planning to attend physically at NANOG in November, the OCP summit, Table Tech Expo.
[Dudy Cohen]: MWC-LA, Total Telecom in London and many other events that we hope will take place.
[Run Almog]: It seems to be picking up. Hopefully it will actually happen, with Covid limitations and so on. You can actually meet us in these places and of course, our website. Follow us on social media, there’s a lot of stuff that we’re pushing out all the time, continuously, white papers and discussion topics, blogs, videos, and so on. So, we are very active on creating collateral. Perhaps Covid kind of kicked us in that direction, but there are a lot of collaterals which are being created and pushed out.
[Dudy Cohen]: And if you are ready for it, we run remote demos, on-site demos, proof of concepts, you can see it with your own eyes.
[Ethan Banks]: You mentioned some kind of an AT&T event as well, I think.
[Run Almog]: Yeah, AT&T. Part of their deployment and push towards the technology also includes a disaggregation summit event that they are hosting. It’s going to be not just AT&T, not just DriveNets, but other vendors, as well as other service providers. This event is set for September 22nd, which is actually two days before this podcast should air. So, this should be already in the past.
[Ethan Banks]: Great stuff. Well, thank you for joining us today on Heavy Networking. If you’re listening and you want to find out more about DriveNets, they’re all over the socials and so on, DriveNets.com, or to search for DriveNets, that’s pretty much their handle about anywhere and you can find lots of great information there. And if you call them up and are interested, make sure you let them know you heard about them on Packet Pushers, we would appreciate that. And our thanks to DriveNets for sponsoring today’s episode. Our sponsors keep the lights on, and your hosts here at the Packet Pushers podcast network fed and warm.
Now, maybe you’re listening and you’d like to have access to the Packet Pushers global community of IT network engineers and if you’d like that, well, you can join our slack group. It is premium priced at $0 dollars. For your $0 dollars a month, you can chat with other IT folks just like you, about your most difficult IT problems, get advice from folks who’ve been there before and share your own wisdom and again, premium price debt at $0 dollar… That’s a joke, it’s free, right? You’re with me? Okay.
Act now, next month we’re gonna double the price, that’s all at packetpushers.net/slack. Read the rules on that page, just a few of them. Read the rules and then sign up. Last but not least, remember that too much networking would never be enough.