How DriveNets leverages Ethernet fabrics for AI Networking
Packet Pushers Heavy Networking explores how DriveNets optimizes Ethernet for AI networking. To run AI workloads, a network needs thousands of GPUs and those GPUs must operate in sync. If there is congestion or dropped frames, very expensive efforts could be delayed or disrupted. While there are advantages to using Ethernet for AI networking (including engineers well-trained in the protocol and a robust ecosystem), it wasn’t designed to be lossless. Sponsor DriveNets’ fabric control mechanism puts Ethernet in play. A scheduled Ethernet fabric eliminates the potential for congestion, and employs other techniques to ensure network traffic is balanced across the fabric.
Full Transcript
Welcome to Heavy Networking, the flagship podcast from the Packet Pushers. Discover all of our podcasts for IT engineers at packetpushers.net. Elect your favorites and broadcast them to your ears with high priority each and every week. I am Ethan Banks with co host Drew Conry Murray. Connect with us on LinkedIn, where we post about briefings we’ve taken, covering vendor news, podcasts we’ve released, industry events we’re attending, and other nerdy stuff for your education and amusement.
On today’s show, we discuss building an Ethernet network for AI computing stacks with sponsor DriveNets DriveNets is a disaggregated networking vendor. Run their NOS on a white box switch, do it at scale, and do it cost effectively. The DriveNets product set serves several networking use cases, including our AI use case today.
And to help us understand the unique requirements of AI computing are our guests from DriveNets. Run Almog is the head of product strategy and Yuval Moshe is the SVP of products and head of AI/ML infrastructure solutions. And Yuval, the first question goes to you.
Could you explain the AI computing problem to us? Because we keep hearing about this, not just, you know, AI is cool, but that if you’re trying to do AI on your network, it’s a thing you gotta pay attention to as a network engineer. What’s different about AI workloads that network engineers need to be aware of?
The interesting part about networking in AI, that it’s not just speeds and feeds and maybe low latency like we’re used to from the age of storage and compute. You as a network engineer really need to know the AI language or the application that runs on top. You have to understand the iterations within the communications between the GPU. You have to understand the internal connectivity within the server between the different GPU’s, whether it’s a proprietary infrastructure interconnect, or it’s an open one like PCIe or generic ethernet services. You have to understand GPU to CPU connectivity. You have to understand the NIC that is involved, NIC versus SmartNIC and so on. And only then you can start talking about communications and networking between the GPU’s, much like similar Clos networking or Infiniband or any other type of Ethernet. So it’s much more complicated to understand the use case that really runs on top of those networks. So when we’re dealing with the AI workloads, then I know part of the challenge is the fact that it’s a cluster, there’s a GPU cluster, there’s a whole several different systems working on the same math problem.
Can you elaborate on that and explain the challenge?
Let’s take a model as an example. Maybe one of the large ones, like a lambda model that has several billions of parameters that it needs to run the compute on top of those. So you need to actually do parallel of the model across several GPU’s. And then you have to also do the same process for the data if you have a very large data set. So now you have to make sure that there is communication between all of those different GPU’s to calculate the weights of the model. And then you also have to make sure that the data set is really set across all of those GPU’s and they all run within the same iteration before they’re doing the next calculation or they’re going up the stack within their language model. So you have to make sure that every time you run part of the model, all of the GPU’s are really connect, talking to each other, communicating a very high bandwidth for what you used to call as a bisectional bandwidth across the entire infrastructure. So imagine connecting 16,000 or 32,000 GPU’s within a single network, which is a very big network, and all of the GPU’s have to support full speed, full 100% traffic on 400 gig or 800 gig speeds. And once that done, you just move to the next step within your model. So you have to make sure that the network keeps on running 100% of your time. Because if you don’t, and you don’t have high performance or utilization of your network, you’re pretty much losing money. The GPU is at the most of the cost of the infrastructure is, let’s say around 85% to 90% of the infrastructure is the GPU. If you’re not running it at 100%, your network is not fully optimized, then you’re pretty much losing money on your investment. So the idea is that these AI workloads aren’t really tolerant of delay. So anything that introduces delay essentially affects what they call the job completion time, meaning how quickly you can get that job done. Yes, exactly. And it propagates from one to the other. So if you have one GPU that is lagging, or there is some jitter introduced on the line, then all of the other GPU’s are going to suffer for that, because essentially part of the model calculation, they are actually waiting to get an answer from that specific GPU. So the entire GPU cluster is waiting for one to answer. So you cannot tolerate one failure in one of the GPU’s unlike in standard compute or storage, where you can hand off or have some back pressure on the request of the user to access that resource in the server, in the GPU and the AI use case, it’s impossible. You just can slow down the entire job completion time of the model and that’s just going to take longer to complete, let’s say the training activity. Keep in mind that there is a scheduler here involved, and the scheduler assumes that everything works as planned. It allocates GPU’s according to the job that is currently run. And the assumption is that the network is perfect, pitch perfect, meaning that every packet sent into the network arrives to its destination 100%. Anything outside this perfect situation is a flaw. And that flaw affects how the job is being run, how many reruns are being reprocessed by the platform, by the cluster. All of this is degradation in terms of the GPU utilization levels. Okay, so you’re talking about the criticality of the network here, and then we’re talking about Ethernet as the transport of choice, which is like, that’s not, wait, where were you gonna use
Ethernet for this, when every frame’s gotta be delivered in a very timely fashion? Because of course we know Ethernet is not built for that. Now you guys have a solution for this, and there’s other people that are working on this challenge, but maybe before we get to that, start talking about those details, we should talk through what do people typically deploy for a network when they need to build to support these AI workloads?
Traditionally, where it came to data centers, it was all Ethernet connectivity was kind of loose. Ethernet, as you said, is indeed lossy. But the system as a whole would be able to compensate for packet loss. When it comes to AI, the system is less tolerant to any of this. I think the example, if I take back, I don’t know, ten years, then we had high performance computing applications, and that’s essentially the same type of requirements from the network. What we had back then was either being optimistic and running Ethernet, or running anything proprietary, starting from omnipath from Intel, or Infiniband from Nvidia, or back then it was Mellanox. Other proprietary solutions came from Cray or Qlogic, lots of other options, all within that mix of HPC space. As it evolved into AI, a lot of these options deteriorated or vanished, actually. And what we have today was either Infiniband, which is very predominant coming now from Nvidia, or attempts running this in Ethernet with the kind of add on of a smart endpoint just kind of introduced into the network as something that dictates or manages how congestion is being created into the network before it’s being created. Kind of a prevention mechanism to buy a smart endpoint.
Does that mean like a DPU or a SmartNIC that’s optimized for this kind of use case?
Yeah, SmartNIC, SuperNIC, intelligentNIC or DPU. All sorts of names that essentially say the same thing. To add to Run’s point, when you take a look at it from a technical perspective or network perspective, you would deep dive into the details of Infiniband versus Ethernet, or enhanced Ethernet or scheduled fabric, and try to diagnose what’s the best technical solution for the technical problem that they have. But I think, and that’s the major discussions that we have with most of the customers, it’s a business question. In the end, you’re paying a lot of money on the GPU’s that you bought. Now you want to get the highest job completion time or highest performance as you can out of those GPU’s. But when you have to buy a proprietary solution for one company, that may have really superior performance compared to other solutions, but you know that you were stronger for many, many years. And to have just those specific types of GPU’s and just that specific type of interconnect, that’s going to create a problem in the long term from supply and share perspective, from cost perspective, because you’re not just buying the GPU’s, you’re buying the GPU’s, you’re buying the Infiniband, you buying the NICs, you buying the whole lot in one single solution. Whether other alternatives like enhanced Ethernet or scheduled fabric, they’re more open. There’s more vendors out there in the market, and you know that you can have diversity in your supply chain. So in all honesty, today, network engineers needs to decide what’s the most critical from business perspective in terms of time to market, openness and supply chain diversity. So you have different, I would say, job completion time, performances, failure recovery, and other I would say different parameters in networking that divide for between those different technologies.
But the critical question is, what helps the business? What really supports the business for the long run, and not just in 23, 24? So I think that’s the critical questions that you need to answer. Okay, so what helps the business?
I know from a cost perspective, if you get away from Infiniband and stay with Ethernet, that’s going to save you some dollars. And you’ve got more engineers out there that understand how Ethernet works how to operate an Ethernet network. But as we’ve established, there’s challenges with Ethernet. Now you mentioned in passing enhanced Ethernet, and I’m aware of the Ultra Ethernet Consortium.
When you talk about enhanced Ethernet, are you talking about the work that the UEC is doing?
That, and not only that, enhanced Ethernet kind of coined a couple of years back, and this is about adding congestion control mechanisms into Ethernet. It exists, it’s standards that do exist, like priority flow control, congestion notification mechanisms. All of this is with the purpose of turning Ethernet as a lossy technology into something which is lossless. It doesn’t work 100%, but it does improve performance or reduces packet loss in the network. This is what’s known as enhanced Ethernet. What the Ultra Ethernet Consortium is doing is trying to define a new standard completely, a new transport layer that is supposed to manage a level of communication between all the endpoints in a way that the whole network is lossless. The whole network kind of prevents congestion before it even happens. This is a bigger aspiration of the Ultra Ethernet Consortium, with a group of companies involved with the Ultra Ethernet Consortium. I do assume that it will eventually happen, just a matter of how long will it take. And our expectation is to see something available, and perhaps even in deployment within a couple of years, which is record breaking when it comes to defining new standards. From the perspective of the AI world as it is today, or as it is rampaging today, two years is eternity. So somewhere in between these timelines. Well, and to your point then, if we’re waiting two years or potentially more, for the Ultra Ethernet Consortium to come up with new standards, you have a solution that works now.
So why don’t we dive into what the DriveNets solution for AI networking is all about?
You’ve got something called the distributed disaggregated chassis, and that’s not a proprietary DriveNets thing. The DDC is an Open Compute thing that was contributed to the Open Ccompute Project by AT&T, if I remember right. Tell us more about that architecture. DDC, indeed, defined by AT&T and contributed into the OCP. It’s about seven, eight years old. It’s not a new technology that we evolved now for the purpose of AI networking. It’s something which is being in deployment for several years. It’s essentially mimicking a chassis distributed or broken into its sub components, and every subcomponents act as a standalone. What this gives us is the flexibility to build essentially any size of a chassis that we want, because there is no metal enclosure surrounding the components. That limits the size or limits power consumption or limits heat distribution. Everything from the perspective of a physical limitation is cast away. What we are left with is a performance of a single standalone chassis. It’s just that it’s comprised of multiple dedicated elements. When you take this and try to align an AI workload, the optimal type of configuration you would think to have would be to connect all the GPU’s to the same device. Logically, this is what we are doing with DDC because it’s the same logical device. All these components are not building a network. They are building a scheduled fabric. An example of a scheduled fabric is a chassis. Another example of a scheduled fabric, only much, much larger, is DDC. Okay, so let’s take a step back here. If we, if we think about a chassis switch, a lot of engineers have worked with chassis switches that have line cards, and we say, okay, there’s no chassis, but it’s going to function like a chassis.
So we’ve got, our line cards are in scattered in racks around the data center, and we’ve got some kind of a fabric mechanism that everything’s interconnected to. This isn’t just a Clos topology. It’s not just leaf spine, because we’ve got Ethernet facing ports. We’ve also got fabric ports. Is that right?
Yeah, this is correct. It is a Clos topology. Just as you said. Interfaces towards the outside world are standard Ethernet interfaces towards the internals of the, of the DDC are fabric interfaces. If I would compare this to a chassis, then that would be the back plane of the, of the chassis. You don’t really see it in a chassis. It’s kind of under the hood or behind the scenes in a DDC. These are yet another standard QSFP interface that is running at 400 or 800 gig. Obviously, you can monitor these interfaces because they are real physical interfaces, but from a performance standpoint, it behaves just like fabric of a chassis. Yeah. So the key, if you think about it today, between a line card and a fabric within a chassis, you have services on the chip, and there’s internal connectivity in the chassis. That is, you don’t see them as the user or the operator. But in the case of a DDC or a scheduled fabric, this is the first time you really expose those to the user because there are actual physical interfaces on the leaf or spine components of the DDC. So now you can monitor them and you can connect all kinds of optical connectivity, 400 gig or 800 gig. So you’re pretty much using the same technology, but those are standalone boxes, very similar to what you have in a Clos formation in the cloud. Leaf and spines, they have connectivity uplifts or downlinks. Just in this specific case, the uplinks from the leaf to the spine or the other way around are actually fabric cell based connectivity and not Ethernet based connectivity, which is the building block for the end to end scheduling and cost that infrastructure, that’s the main there. Instead of a network of BGP across multitude of leafs and spine, the connectivity between leafs and spines is closed. It’s actually cell based. And this is the basics of the So it’s very much like the back plane of a switch, a chassis switch, and the fabric modules that would be in there. It feels very much like that. Only again, we’re not contained within a chassis.
Do I have to worry about oversubscription?
Same way as you have it in a chassis. You don’t assume that you have oversubscription in chassis. When you take a chassis and put it into a perf test and you inject it with full line rate speed traffic on all interfaces, you don’t expect traffic to drop. Right. That’s how a chassis behaves. It’s the same way here. You can inject traffic through all the interfaces. As long as you’re not deliberately injecting more traffic than a single interface can tolerate, then it will simply not lose any traffic, just like one network element does. So we’re going to get into the notion of scheduled fabric and cells and so on, which you referenced.
But before we do that, can you just kind of describe the DriveNets DDC architecture you’ve mentioned? Leaf spine. Should that sort of be the physical model I’m thinking of? And how am I connecting my GPU’s in a rack? And how am I connecting GPU’s between racks? Is it that leaf spine model?
Yeah, exactly. There are two types of boxes. One is what we call NCP, the Network Cloud Packet forwarder. That acts as like it’s a line card. If I would compare it to a chassis. Another element is the NCF, the Network Cloud Fabric. That’s the comparison to the chassis device, which is the fabric of the chassis These two elements can be located in your rack, however you prefer. Typically what we recommend or what we get from customers is to position the NCP as a top of rec device so it’s actually connected to the GPU’s, to the servers, and then all the fabric elements are located in some remote rec connectivity can span up to 200 meters between these two elements. So you can position these NCF devices at a certain location and just wire them with fiber from NCPs, which are scattered throughout your data center onto that one fabric device or one group of fabric elements which kind of aggregates all the traffic.
So is that NCF acting kind of like a controller?
So the NCP is pretty much similar to the top of work or leaf with the GPU connectivity. The NCF is pretty much like a spine. It just has connectivity between all the leaves within that single system. Got it. As we were talking earlier, Ethernet’s not designed to be lossless or to deliver frames on a scheduled basis. That’s not part of what it does.
So what techniques are you guys using to optimize Ethernet for AI?
Because again, going back to our original problem, we’re trying to deliver frames in a timely, guaranteed way. So there’s, there’s magic here. And you’ve mentioned scheduled fabric, and I think VOQs came up and it sounded like you said, I don’t think you used the words credit based forwarding, but that’s what I was thinking of. I was reminded of SAN architectures and so on. Fiber channel. So walk us through the magic that is making the Ethernet function here for us. So let’s first of all start talking about VOQ. Okay. Virtual Output Queue is not new. It’s been there for many, many years, been used obviously in chassis on various incumbent silicons. They’re pretty much the building block of reassigning queues across your chassis or across your internal network. You could say that allows you to allocate resources for each one, in AI case, for each one of the GPU’s. So if a GPU wants to communicate to other GPU’s within that single system, it allocates a Q pair for that connectivity. So VOQ helps us to really build an end to end system with all the VOQs redefined in advance. That helps us to avoid situations of in-cast congestion at the end, where you suddenly send in traffic towards a remote leaf or remote spine, and then you have congestion at that endpoint and then you have to revert back or maybe notify back to the sender that you have to hold off for a second, there’s not enough room or we have issues with the buffer. In the case of a VOQ end to end system, you don’t have that issue because the resources are pre allocated. So when you’re sending that packet towards the remote GPU, you know in 100% guarantee that traffic is going to reach that GPU. So that obviously helps with the job completion time. Performance, because the higher amount of bandwidth you can send across your infrastructure, the higher job completion time you’re going to have.
Now, as a network engineer, do I have to design those VOQs and set buffers?
Because I’m having these horrible nightmares about QoS and I could never get the buffers quite right and so on. It was terrible. So you’re absolutely right. I think when you take a look at the standard network Ethernet class with any type of low latency switch, you have to tune those buffers, you have to configure each one of those switches along the way, and God forbid you have a problem, you have to understand exactly what happened. Well, but in the case of a scheduled fabric or an end to end VOQ, that’s the whole point. You don’t have to do it. The system does it by design, so you don’t have to tune the infrastructure per model. This is something we’ve saw from one of our customers. He has an engineer that pretty much on a weekly basis defines 20 parameters depending on the model they’re deploying at that same Sunday or Monday, which is ridiculous. That’s a lot of manual work just to fine tune your infrastructure. And there’s really no way to do it automatically per model. Once you’re using a schedule fabric system, it does it for you. It already pre allocates all the resources in your system. You don’t have to worry about the Q buffers, you don’t have to worry about running out of resources across one of the switches or leaves or spines that you have in the infrastructure. There’s one point which we are one layer that we are neglecting to talk about, and that’s a segmentation and reassembly layer. All packets are being fragmented into cells as they are injected into the fabric and then regrouped on the receiving side and recreated as a packet which is being sent to the receiving end GPU. This negates anything to do with workload or types of workload. You mentioned buffer configuration, which is a big headache when it comes to packet based networks. Indeed, this is an expertise that is really sophisticated. When you take that and you multiply that by the amount of or different workloads that you have running in the network, then it’s a full time job, not just of one engineer, but a group. We don’t have this situation. Packets or workload injected into packets and into the DDC are all being fragmented into cells. So whatever is running inside the fabric, whatever is being applied to the VOQ mechanism, to the credit based mechanism. Are all cells just agnostic cells and therefore anything to do with the workload type of workload traffic pattern is completely agnostic.
But do these cells still get some kind of encapsulation so it knows where it’s supposed to end up or that a node knows where to send it?
Yeah, absolutely. There is some internal overhead that is involved with fragmenting a packet into a cell. This is resolved or compensated by the fact that there is always an additional capacity when it comes to the fabric layer versus the network or the interfaces layer. So there is always kind of an n plus one mechanism of bandwidth allocation. Okay, so just I want to sort of draw a diagram in my mind. I’ve got a GPU. It’s sending a workload up into I guess a top of rack switch. That top of rack switch is then doing this fragmentation and sending those cells onto a receiving top of rack switch. It is actually, it is going back to what we mentioned. It is actually being sent to the spine DNC, all the spines, those spines are cell switching, okay? That’s the key point. They’re not Ethernet, they’re cell switching devices. So they do the cell switching and they distribute it or actually send it to the remote top of record from the other side. The communication actually between those two endpoints or those two top of rack is actually the VOQ because you know, on the egress interface on the remote side, you already have resources assigned to you. So you know, you can send the traffic towards that remote top of rack. And the fabric switch is the brain really that manages it. There is a sophisticated congestion control mechanism that runs automatically within the system. That’s the key point. If you were to take the same solution and try to implement it even using InfiniBand, they have an external controller that does that. If you had to do it with a enhanced class or enhanced Ethernet you would have to have it. Maybe on the smart. Something has to do with the congestion control mechanism on the scheduled fabric or DDC. That’s the magic behind the communication between the fabric switch on all the leafs connecting to the GPU’s. Okay. And then at the receiving switch that’s doing the reassembly to pass it on to the GPU. Exactly. That’s where it reassembles with it from a cell to actually Ethernet packets.
So then my next question is, why am I doing this if I’ve already got Ethernet frames that are already divided? Why do I want to subdivide those frames into something smaller or into something else?
It’s because Ethernet frames have a header and that header would then in an Ethernet network would dictate which spine device will be accepting that specific packet and then forwarding it to the receiving end. This drives kind of unbalanced utilization level of all the spines. Lack of balance eventually results in congestion. What we have in a DDC solution is that the NCP does not select which NCF to send the traffic to. It just sends the traffic in the form of cells to all the NCF devices. The traffic is being spread onto all of the NCF devices in the network and then recollected from all of the NCF devices on the receiving end. This is how we keep the utilization level exactly the same on all the NCF devices and avoid potential congestion. Obviously the network can become exhausted, but all NCF devices or all fabric devices will become exhausted at the same time. So instead of ECMP, where I’ve got a hashing algorithm that’s hashing on the Ethernet header, let’s say it, hashing that flow on the same link all the way across. I’m saying I’m going to split this frame up into cells, spread the cells across all the different spine devices that are a part of my fabric, my backbone fabric, and then now I’ve got perfectly even distribution across my four links or eight links or whatever I’ve got. That’s spot on. One of the, I would say, key parameters of AI workloads, and we haven’t discussed that, is the fact that those are elephant flows. Now try to put elephant flows on all of the GPU’s and they’re only using ECMP, which has its ups and downs. It’s definitely not equally load balanced. Okay, there are limitations in number of ECMP’s you can have on a specific type of switch. And once all of your workload is based only on elephant flows, that makes the problem much, much worse. And you have something which actually dissembles and reassembles the packets into cells, that helps you completely, I would say mitigate that problem so you don’t have to use standard ECMP. And now keep in mind, we’re always talking about maybe two stage class, maybe a relatively simpler solution. But think of very large scale networks with the latest switches that you have out there in the market. You’re going to reach free tier class and beyond. Now imagine free tier class just based on ECMP with no smart congestion control. That’s going to be a nightmare to operate and fine tune.
So if you have a system that does that for you, and does the end to end VOQ and does the cell assembly reassembly?
That pretty much solved that operational nightmare. This is what’s known as the cross bisectional bandwidth problem. When you have an Ethernet or anything packet based or affected by flows, the cross bisectional bandwidth degrades. When you have a DDC solution, cross bisectional bandwidth is equal to the bandwidth of your network interfaces. Just one to one. Actually, a bit more, just like I said before, because of the overheads. Okay. And so, just so I understand, in my little network diagram I’ve got in my head, my top of rack switch has divided this frame into cells. Cell A is going to spine one, Cell B is going to spine two, and so on and so on. Yes, exactly. And each one of those cell switches just send it to the destination top of like, let’s call it x. And he reassembles everything. Got it? Okay. If you will, try to compare it also to a Clos you would often hear also talks about what about reordering. So that’s, I think that’s the exact place where there’s no reordering at the endpoint or at the top of work. You don’t need a super nick or a DPU to handle the reordering for you. It happens on the fabric itself. So the key point is that all of those functionalities like VOQ, congestion control, reordering, they’re all handled on the network layer, they’re not handled on the server or on the GPU. So you don’t need unique hardware for that. You can pretty much support any NIC that you have out there. And it doesn’t have to be a special NIC just specifically for AI because the network can handle it for you.
So I think this notion of scheduled fabric also has something called credits, credit based forwarding. Can we get into how that works and what role it plays?
Yeah, traffic coming in. There is an internal signaling mechanism between endpoints of the fabric. Indications that are being signaled from the receiving end as to the capacity, the amount of bandwidth that it can absorb. Then and only then, traffic is being sent onto the fabric layer to the spine layer of the DDC. Up until that point, traffic is being held on the ingress and not injected into the network. And these mechanisms are hardware based and they’re active all the time. So endpoints are always aware of what’s going on on the farthest end of the receiving side of a packet.
And when you say an endpoint, are you talking about the GPU or the NIC or the top of rack switch. What constitutes the endpoint?
In this case, it’s the network interface on the receiving NCP. Okay. This mechanism is active within the inbounds of the, of the DDC. It doesn’t extrapolate itself onto the GPU or involves the smartNIC or NIC or any other server endpoint. Okay.
And in my mental model, the NCP I’m thinking of is like a top of rack switch. So the top of of rack switch is a receiver, and it’s essentially controlling the rate at which traffic is coming into the fabric from the other end, from the ingress, is that right?
Yes. It’s very similar to how a flow control mechanism would work. You actually know how much traffic you can absorb within your buffers and you have to communicate back to the senders. Hold off for a second. I don’t have enough room in my buffer. So that entire logic exists within the networking infrastructure between the top of rack and the fabric switches. Okay, so I’m a receiver and I’m essentially signaling back to an ingress switch. Wait, I’m not ready to receive traffic, but when I am ready, I send you, I guess, a credit that says you are now allowed to send to me. Yes, exactly. And that credit pretty much tells you you can now send traffic on that specific virtual queue. So you have the visibility of all the egress queues in the systems of all the other top of racks. And once you have credits, you can send traffic. If there are none, you will not send any traffic. That pretty much eliminates the in cast congestion scenario. You usually see in a standard if on a network where packet is being sent, but only on the remote side, you see an issue because suddenly there’s buffers there. And then you would start seeing ECN or all kinds of other mechanisms that help you to maybe try to mitigate the problem after it happened. That’s the key point. Once there is a problem, then there’s systems and ways to try to maybe optimize it later on. But the whole trick is to try to avoid it to begin with. Right. The whole idea is that we want to make sure we don’t need to retransmit. Exactly. Yeah. It’s a funny mix of, I think of standard things and then some, not standard things that are going on here to make all of this work. Like things like ECN, congestion, notification, that kind of stuff has been around forever, and then breaking things into cells. And the way we’re handling VOQs and some of this other stuff is a bit unusual, I think a lot of network engineers would not have run into this before or it was happening inside their chassis switches and they just didn’t know exactly. I think the latter is one. I mean it’s existing for many, many years in every incumbent chassis. It was never distributed. It was always just a simple switch at the hyperscalers or cloud infrastructure and there was no communication between all the other switches, just pure routing. But if you take a look at the chassis 15 years ago it was the same architecture, nothing changed really.
Well, talk to us about Ethernet chipsets. You guys don’t make hardware, you’re an operating system manufacturer. So what chipsets, what kind of switches would I be running this on? So you ride with a software company?
We’re running, we’re long time partners with Broadcom. We’re running on top of their Jericho2, Jericho2C+ for many years and lately the Jericho3-AI and Ramon3, which those are chips are suited specifically for AI. They support 800 gig on day one. We already have it out in the market. And the types of platforms that we have are supporting from 30 tera to 50 tera to 100 terra. in a single system. So we have very large spines that can support very large radix of leafs or top of racks within a single system. Using those building blocks and Broadcom chipset we can reach 32,000 GPU’s of 800 gig. If you want to reference the latest Blackwell, we can support that. Pretty much 32,000 of that latest B100 in a single system in a single network. Again, it’s all end to end schedule. It’s not like having free tier class with hundreds of switches. This is a single system that can handle congestion control internally on the network for a very large scale. And to break that out, my understanding is the Jericho3 chipset that understands the credit based mechanism, it understands VOQs, it essentially builds a schedule fabric and it’s the remotes that understand how to very quickly switch or route those cells that are being sent into it. Exactly. Jericho3 sits at the top of rack and the Ramon sits at the fabric switch. Okay, so this capability, this scheduled fabric is essentially coming out of the Broadcom Jericho3-AI chipset. Yes, it comes out of Jericho exactly what we have. We’re a software company. So the solution that we have is really building the distributed solution across all of those platforms. If you were to take a simple cloth switch, is just a single standalone platform. But in our case, we have to make sure that the entire system communicates internally and behaves as if it is a single very large chassis. So we need to have a software that runs on top, manages the platform, manages the internal communication, and make sure that the user, I would say from user experience perspective, he doesn’t need to deal with any of the tuning or communication within the system. That’s what we do, that’s where our software comes in. Okay, as we said at the beginning, DDC is an OCP standard. The same components can be run by alternative software as well. We’re not the only game in town. A scheduled fabric is a well known solution in the industry today. It’s been discussed by Cisco as an example by Broadcom themselves with more than just one software solution as an option.
So other solutions do exist and the technology is bigger than what DriveNets is doing specifically. So when I’m getting a DDC solution from you, am I bringing my own hardware or are you building off sort of white box and providing that hardware to me?
So actually both. There are customers, especially in the AI market, that want to build or actually want to buy a single solution that has hardware and software full blown already tuned. They want to have everything ready and deployed because what’s important to them, for business perspective, it’s time to market. That’s one type of customers. If you’re talking about the tier ones, hyper scalers, they’re all looking into maybe even building their own hardware or their own white boxes and just looking to buy the software from us so you can. It depends on the customers. Usually the big ones want to build it by themselves and maybe purchase the software to begin with. I mean we all know Microsoft, Meta, they have their initiatives on building software for the long run, but it’s been like that for many years. The big cloud providers, they want to control their own destiny from infrastructure perspective. But I can just come up to drivenet and say, just give me the whole thing, hardware included. Definitely a lot of customers. Most of the customers actually today are trying to go towards that direction simply because of time to market, not necessarily that they want to build now supply chain connections and so on and try to ramp everything from scratch. They won’t have a solution up and ready in a month. I mean the timelines are really hilarious. I mean the request to get systems of 20,000 GPU’s is let’s do it next week, I’m going to set my racks and just ship it over. So definitely times in the AI is very different from what we know. Try to think about the alternative. When you’re going into a deployment of an AI infrastructure, then you can get the whole solution from a one stop shop called NVIDIA. This is exactly what customers are trying to move away from now. They don’t have the expertise to take every component as a standalone and try to assemble the whole thing. That’s expertise that is evolving as we speak, so they don’t have it in house. From that respect, getting the GPU’s from vendor A and network components from vendor B is too complicated. So they need one component called a network. And that network needs to work from day one with minimum configuration, minimum manipulation, minimum options and minimum interactions with various vendors. From that respect, a one stop shop providing the entirety of the network is exactly what potential customers are seeking. Okay? And you can provide that to them using an Ethernet based substrate as opposed to InfiniBand. Precisely.
You mentioned that the solution can get big. Can you talk about how big? How big? Have you tested a DDC fabric for AI?
So the solution can reach up to 32,000 GPU’s of 800 Gig, which is the next generation, pretty much what’s not yet available by NVIDIA, but pretty much it’s already ready for 800 gig. We have tested, actually I can’t specify the customers, but we’ve tested with several big hyperscalers. Usually topologies are around between 500 to a few thousands of GPU’s because keep in mind, they need to keep those GPU’s just for testing, so they’re not really eager to do that. They usually test it for a couple of weeks just to see how the models runs on top of it, maybe a few months. So we’ve tested from a few hundreds to several thousand up to around 4000 4k GPU’s in a single system. But we’ve also done simulation with Scala. Scala is doing a simulation of the Broadcom chipset models and runs language models on top of it, and they’ve done comparison as an example between standard Ethernet Clos switches if you wish, Tomhawk4 and Tomahawk5, versus DDC switches Jericho3-AI and Ramon3, and definitely showed at least 10% job completion time performance compared to a standard Ethernet cluster solutions. We did also virtual simulations of the models using the chipset, and we’ve done physical POCs at customer sites to really prove out that there is a big difference from performance perspective. Now keep in mind 10% job completion time performance pretty much pays for the entire network because the entire network solution is around ten, maybe 15% of the entire AI solution included in the GPU. So imagine such an improvement in your job completion time. The GPU just runs more time versus any other solution. So definitely that’s a big improvement. Even 10% is a very big improvement from the customer point of view. So for the fans of Infiniband out there, and you’re looking at maybe considering this solution, if I want to compare those two, what are the big highlights of why I’d go with the DDC Ethernet solution versus the Infiniband that’s been around forever, and maybe I, maybe I trust it or whatever, If I, as a big Fortune 500 enterprise, have no problem with being, let’s say, strung out by a single company, and I will buy the entire solution from them and I’m happy with it, I’ll be very honest. It’s a very good product, really. Infiniband has been there for many, many years. It’s been proof tested in all the cloud providers. So you would say there’s no reason for me not to choose it, right? And let’s assume there’s no supply chain shortage, and you can get all the GPU’s and Infiniband and everything in a single a solution, but that does not fit most of the companies out there today, because they do not want to get tied to a single vendor with a proprietary technology. Keep in mind, if you buy an Infiniband, you’re going to stick with Infiniband for many, many, many years. It’s not going to be for two years, because you invested tens or maybe hundreds or maybe more millions of dollars on the GPU’s and you bought Infiniband with them. Now you want to switch it to Ethernet in two years. That’s not going to happen. You making a decision now. It has to be a strategic decision. You have to make sure you buying from at least two vendors. That happens a lot, by the way, the hyperscalers, they’re buying both Ethernet and they’re buying both Infiniband because they want to pull for both technologies. But let’s be honest, essentially, they want to make sure they have diversity in their supply chain. So that’s the big question from the business perspective, make sure you’re not always selecting the single vendor that rules the entire market, because you’re going to stick with it for many, many years. Technology wise, we know that Infiniband works. It’s been around for what, 20 years. It’s been building supercomputers, the fastest supercomputers in the world. So definitely it is working. There is a learning curve to do with Infiniband, getting an infiniband network to work is an occupation by itself. Typically, you don’t find these people everywhere that know how to operate and build an Infiniband network. It’s harder to find than when it comes to Ethernet, which essentially you can just throw a stone in Silicon Valley and you hit an Ethernet guy. That doesn’t happen with Infiniband. Now, when you’re running a job with Infiniband, you need to tune your network accordingly to fit that job. Specifically. Job tuning is something that still needs to be done when it comes to an Infiniband network. And when you have a deployment which runs multiple jobs or such, that is an infrastructure layer on top of which you don’t even know what kind of workloads you’re going to run on. And over time, these workloads are going to change. It’s kind of, again, it’s an endless job of tuning the network accordingly. It’s not to say that it will not work, it’s just that it will not perform as good as Infiniband can perform. And this is exactly why you put Infiniband for the performance side, there is a lot of tuning, there is a lot of handling when it comes to an Infiniband network versus DDC, which essentially is Ethernet. So everybody knows how to operate and handle the Ethernet side of things. And then when it comes to anything to do with handling specific jobs or tuning of a network, as we said before, that’s not a part of the solution. It’s just inherently done by the internal fabric.
From a network engineering perspective, as I look at living with a disaggregated network fabric like you’ve described day to day, are there special considerations for me as I manage this thing, or monitor this thing, or need to troubleshoot it? Are there unusual concerns I might have?
I’d say there are unusual advantages. Handling lots of boxes, lots of network devices. When you’re building a huge data center, it’s obvious there are lots of devices. It’s been like that for the past two decades. So that’s not news. Clearly, you need to integrate all of the components into a management platform of sorts to handle alarms and whatever. This is a given. The thing is that everything to do with failure detection is hardware based, whereas in a network, any sort of a network, it’s going to be protocol based. Anything to do with protocol based is relying on software based timers. Even outages in the network is something that are going to take like three iterations to indicate explicitly that there is an outage and then dictate what’s going to be the outcome, how is the network going to handle that specific outage and so on, that could result optimistically. Let’s say 50 millisecond. Recovery is amazing in a network, and when it comes to DDC, we’re talking about microseconds of recovery. Because everything is hardware based. There is a keep alive that keeps all of the interfaces between NCP and NCF always on, always active. When something goes wrong, the hardware knows what to do with it even before the job is experiencing any packet loss. So all of this is kind of resulting in the fact that jobs don’t need to reset themselves, create any retransmission or kind of pause and roll back to the latest breakpoint of the job and eventually result in wasting CPU or GPU cycles. Let’s draw the network map. If I had a 3 tier Clos, for example, with multitude Clos switches, the issue is not with the standard monitoring. The operational issue comes when there is a failure, and not just from failure recovery times. If you had now to debug an issue across 1000 switches, you’d have to start tracking your BGP connectivities and IPs and tracers, trying to figure out exactly what happened to my flow. Somewhere along the way there is a bug or there is a glitch and something happened to my packet. Now in a scheduled fabric solution there isn’t. There’s only one top of rack on the ingress side and one top of rack on the egress side. Everything else in the middle is sail switched on the fabric. So you don’t have to deep dive into each one of the links across that huge infrastructure. Keep in mind, 32,000 GPU’s is tens of thousands of cables in a single network. Let’s imagine. Imagine managing such a network based off Clos switches. You’re talking about hundreds of switches with hundreds of BGP, or thousands of BGP connectivity, or any type of really routing protocol that you would have to figure out by yourself. Who configured something wrong across that infrastructure? Where was there a mistake in the script and why the flow isn’t working right now. So I think that’s the big pain point, because other than that, the look and feel is pretty much the same. Those are standalone boxes. They have their own northbound API, could be standard one, Netcove, GOPC and so on. So you can do telemetry, you can do monitoring, provisioning. Everyone’s pretty much very similar to any standard switch that you have there in the market. But when you have to manage such a large system as a single network. This is when things start getting very complicated, so it’s easier to manage it if it acts and it feels like a single system.
So here’s a maybe slightly left field question. We’ve been kind of talking about molding Ethernet to kind of essentially fit the requirements of GPU’s. But there are now startups out there building custom silicon for things like LLMs. Have you guys started thinking about whether new kinds of chips, non GPU’s, that are built for AI modeling and training, going to create new requirements of Ethernet? Or do you think we’ll still be having to solve essentially the same problems? A GPU does not interface directly to Ethernet, right?
There is a network interface card somewhere in between. Any of the other accelerators also have a PCIE interface towards a NIC. There is a NIC out there who is eventually connecting to the outside world. And that common, most common interface is still going to be Ethernet. It still is Ethernet. And even those accelerator startups who are trying to create an alternative to GPU, they are all indeed also building something which is inserted into an existing standard type of a server, or building that of their own. And the interface towards the outside world is still Ethernet. The reason is simple. It’s the economy of scale. They want to enjoy the ecosystem of Ethernet. They don’t want to reinvent the whole car. Reinventing the wheel is one thing. Reinventing the entire automotive industry, that’s a whole big different ballgame for a startup. So essentially you’re saying then you’re confident that Ethernet as the interface to the network will potentially smooth out any of those complications from a brand new kind of chip. I think there isn’t a single voice in the industry that says otherwise, including NVIDIA. That’s interesting. Well, guys, from DriveNets, this has been a fantastic, nerdy, deep dive kind of conversation. Thoroughly enjoyed all the discussion about how we took basically a chassis with a fabric inside and distributed it around the data center. And that was great. Just a lot of fantastic discussion. Thank you very much for all the details. Now, if people want to learn more about the DriveNets solution, where do they go? Well, tap into drivenets.com We have a landing page for AI networking that gives you all the details of what we’ve been discussing. Thanks very much guys. Run Almog and Yuval Moshe from DriveNets who joined us today. And thanks to you for listening all the way to the end. If you ring up DriveNets to find out more about their distributed, disaggregated chassis, because you got to build an AI networking fabric of your very own. Be sure to tell them that you heard about it on the Packet Pushers podcast network. I would appreciate that. I’ve been Ethan Banks along with Drew Conry Murray Follow or connect with us on LinkedIn and to hear more from the Packet Pushers, spend a little time clicking around packetpushers.net. We refreshed the website earlier this year. It’s awesome. We’ve got free newsletters for you, including human infrastructure covering the weekly goings on, networking and tech, and Packet Capture, the roundup of everything we published in our podcast network, our blogs, and our YouTube channel. And hey, there is also a job board. Jobs.packetpushers.net is a growing list of employment opportunities for networkers. So get in there, look around, and spread the word if you’re looking for work or if you have a position to fill. Jobs.packetpushers.net is a new and growing resource. Thanks again for listening and have a great week. You awesome human. Last but not least, remember that too much networking would never be enough.