PodcastsOctober 25, 2023

DriveNets data center fabric for AI workloads

Packet Pushers Network Break talks about DriveNets announcing a white box offering for building an AI Ethernet fabric to support AI workloads. The white box is built around Broadcom’s Jericho 3-AI ASIC for top of rack or leaf switches and the Broadcom Ramon ASIC for spines. DriveNets can build out a fabric to support 32,000 GPUs on 800G ports and promises lossless Ethernet.

Source: Packet Pushers Network Break 452: Whitebox offering from Drivenets for building a data center fabric for AI workloads

Full Transcript

Speaking of hardware, DriveNets has announced a white box offering for building an AI Ethernet fabric to support AI workloads. The white box is built around Broadcom’s Jericho 3-AI ASIC for your top of rack or leaf switches and the Broadcom Ramon ASIC for spines, the company says it can build out a fabric to support 32,000 GPUs on 800G ports and promises lossless Ethernet. All right, so to understand this because this is like way ahead of every everybody else, all the other companies are basically tuning up their buffers to try and not drop Ethernet frames and making promises that they’re going to have some sort of SDN whenever the Ethernet AI consortium gets its thing together.

Remember Ultra Ethernet? Ultra Ethernet Consortium. Yeah. So what DriveNets is saying our existing architecture already does this. And what they’ve basically done is some customers took their architecture, did a test on it for AI, and found that it is actually well suited and high performance, far more performant than most other alternatives for AI networking. And remember that AI networking is unique when you’re transferring data between two GPUs, which happens constantly, because the GPUs have to all run the same data set. And then once the data is out, they then communicate with each other in the results, and then the processing continues. So any delays, any loss in the network is extremely damaging to the AI processing. And so the way that DriveNets works is it uses white box switches based around Jericho 3 chipsets, the Broadcom, Jericho 3. And up until now, it’s been creating these massively large ethernet fabrics for massively scaled up routing that is literally thousands of physical ports made out of one IU boxes to build a non blocking fabric. Right.

And the way that they do that is they use fixed size cells inside the network fabric. That is that when the Ethernet frames comes in, they then chunk it up into consistently sized frames. I’m not going to go into the exact technology behind that. Talk to them about that. And that means that all of the queues and the buffers and all of the flow balancing inside of the network is optimal. You don’t have this idea inside of the internal asset architecture where you have different sized frames sitting in the buffers. If everything’s the same, then all of a sudden your quas and your virtual output queuing and all that sort of stuff gets dramatically simplified. That is a major difference between DriveNets and they’ve been using that for their routing fabric. So as the Ethernet frames come in, they get turned into these cells. This idea is not new. It’s been around for 40 years. Chassis switches have been doing this exactly for the same reasons. They chuck everything up into a fixed sized cell and then they know exactly what’s going to deterministically. They can move stuff across the back plane and not lose frames in the back plane or inside of the chassis itself. And so this is not unusual. What’s unusual is doing this on a disaggregated network. They say after that customers have gone out and come back to them, said this is at least a 10% improvement in AI processing performance. A 10% improvement in AI processing performance today, Drew, is basically the cost of the entire network. So keep in mind how much an Nvidia GPU cluster costs. So if you can get 10% informance improvement, they would say pay for the internet.

So this sort of response was enough for them to actually go out and commission an independent testing lab. And that’s what the press release is saying. We commissioned an independent testing lab, which of course I’m always dubious about independent testing labs because it’s always possible to get an independent testing lab to say exactly what you want them to say. Yes, but here’s some data for you to maybe if you’re looking at AI networking, maybe this is a path to add to your list. It’s got credible story. There’s a good storyline behind here and they’re doing something innovative, they’re doing something new that other companies aren’t doing. Well, it’s my understanding actually that cell division capability is actually a feature in the Jericho 3-AI chipset itself. I don’t know if it’s something that DriveNets is doing, it’s a feature in the Jericho chipset, the 3-AI, but. As far as I know they’re the only company using it today. Could be. I think companies like Google have been doing this with their internal architecture. They moved to a cell oriented approach a while ago. Their aquila if you look up, do a search for Google, Aquila data center, you’ll find their white paper and this is a step in that direction or this is on a similar train.

One thing to note I almost forgot, DriveNets is saying that in the event of a network collapse so this is what happens when you’re doing mass data transfers at the same time. And if you get some sort of AI data spike or you get an in cast condition where everybody’s transferring data to a small group of ports and they overload. Apparently their customers are saying that DriveNets’ performance is up to 35% better than their competitors in this situation because of this cell based architecture, the fabric recovers 35% better than competitors. And so that’s been very important because once you overload your AI fabric in your network, it’s recovering from that overload situation which is just as important because sometimes you can actually have catastrophic failures where the whole fabric seizes. I’ve heard. So yeah, interesting. Yeah. There’s a link in the show notes. I wrote an article about the Jericho 3-AI Some of the interesting things they’re doing in that article also has a link to a Tech Field day presentation from Broadcom about the stuff they’re doing with this AI chipset. So if you are interested, it is some pretty unique stuff. I would recommend checking it out just for you. I would imagine that we’ll see Cisco and Juniper and Arista heading down this. Oh, absolutely, yeah, for sure. Yeah, 100%. There’s no reason that an Arista couldn’t. They’re already a merchant silicon company anyway, that they wouldn’t take advantage of these features in Broadcom. And saying Cisco ACI would have a much harder time, I think. Anyway, lots of links in the show notes. We’ll move on.