Why networking is now crucial for AI
At the SC25 event, DriveNets’ Dudy Cohen (VP Product Marketing) and Andy Holland (Sales Director for AI Infrastructure) explore how Ethernet is expanding its reach beyond traditional scale-out and scale-across applications to challenge InfiniBand in the demanding scale-up domain.
Chapters:
- 0:00 – Intro
- 01:15 – Tipping point right now between two technologies.
- 03:07 – How DriveNets fits into AI infrastructure
- 05:02 – The first RoCE v2 Ethernet deployment to be more performant than Infiniband
- 06:33 – What are the benefits of DriveNets solution
- 09:14 – Around for 10 years and deployed in the AT&T core network
Read the full transcript
I’m Dudy Cohen and I’m the VP of Product Marketing in DriveNets.
I’m Andy Holland. I’m the Sales Director for our AI Infrastructure business.
We are DriveNets and we are here in St. Louis at Supercompute 25 which is basically the largest event in the HPC and AI industry. We are participating in this event with this booth, but also we are a part of SCinet, which is basically the network infrastructure for this entire event. Also a very cool testbed for new technologies. So I mentioned SCinet and in SCinet we actually have two deployments. One is with our solution for the service provider market, which is basically a router based on white boxes. The second is the AI infrastructure networking solution we have. Those two solutions are basically based on the same technologies, but different size and different use cases. They are both available here as part of the SCinet network. I think the industry realized that while compute resources are very important to the performance of AI, both in training and in inference, networking is the crucial part to make things work and work well.
Just to top up on that, I think critically I’m hearing from a lot of customers there’s a tipping point right now between two technologies.
Right.
We have InfiniBand on one side, a proprietary system and we have Ethernet on the other side being an open system. And I think what’s really changed is the idea that if you can create an open system that’s performant, you’re getting very much more engagement from the community. Because inherently people want to deploy an open system.
Absolutely. I think this trend is visible in all layers of communication. So what you mentioned, InfiniBand towards Ethernet is a scale out technology trend.
100%.
We see it also in scale up. So if until a year ago scale up was synonym to NVLink 100%. Now we see Ethernet based solutions that are getting close to the level of performance of NVLink. So you can have this open ecosystem and standard and basic skillset in all types of networks, scale up scale out. And also we need to mention scale across.
Sure.
Because scale across is a technology that you go to when you run out of power in the data center. You need to extend your.
Seems to be a common challenge right now.
And scale across, asks for Ethernet. But it also asks for very high performance, low latency. And this is something that we see here as well.
Yeah, I think more than anything it’s the curiosity of the people here. I meet people that are responsible for large data deployment systems, people that are responsible for infrastructure, but the common problem that they all try to solve for is how do we integrate stacks of technology into what they want to solve for. And I think that’s the most amazing thing here at the conference is you get these different experiences from people and how we at DriveNets can play into it.
And it’s a great example because I personally had a few people coming to us with a certain mindset with regards to how infrastructure should be built. And you can see the mindset shift as they realize they have better options in terms of performance, in terms of supply chain, in terms of skillset So it’s, it’s a pivotal moment for the entire industry. I agree with you. So, Andy, there’s a lot of, a lot of talk about performance. Absolutely. You know, especially for people that come with InfiniBand in mind and say this is the benchmark for performance. And, and there we have news for them, right?
We do, we do And it’s become a really interesting conversation. As you know, classic HPC has been around since the 80s. There is a very stalwart approach to how you do networking. Right. And I think what has to my earlier point about the idea between proprietary and open systems, there’s a lot of curiosity right now around, do we need to do classic HPC networking for AI? The answer is No. Because the reality is, is that you can deploy a network that is distributed, that is performant, that is open, that meets the guidelines for what classic HPC performance is actually enabling. And I think the most important piece there is that at DriveNets we have figured a way to use your most common assets from a hardware perspective and create the most performant network. I’m very proud since I’ve joined to see that through some explicit testing on NCCL, as you’re prefer the GPU to GPU communication, we were able to show that through certain collective testing, we are the most performant. I mean, that’s an amazing thing.
Even more than InfiniBand, even more than InfiniBand.
I think actually we’re the first RoCE v2 Ethernet deployment to be more performant than Infiniband. I think what it’s done now is create a ripple wave in the communities of people building the space to say to your point, there are more options.
Yeah. And those options are not magic. They are new because two years ago Ethernet was nowhere near InfiniBand in terms of performance. But we have significant developments, but solid.
We have the talent. Yeah.
And for DriveNets, it’s not new because we’ve been doing this for almost a decade in the service provider market, the same technology, this fabric scheduled technology that takes the packets and break it into cell and sprays it across the fabric. You can learn about it in our website.
We’ll point everyone to the website, by the way.
So this technology is a super fit for exactly what AI infrastructure needs. And you can have fabric scheduling, you can have endpoint scheduling in the NICs, which are also valid, like the Ultra Ethernet solution. But basically once you add this extra flavor of scheduling to Ethernet, it’s a valid solution. Suddenly it’s a performance solution that you can use without any performance tax when you move away from InfiniBand.
So Dudy, consistently when I’m talking to customers, the first thing they ask me is great, you can prove to me performance. But truly working with DriveNets, what are the benefits to our solution?
So first and foremost, performance, because this is the entry ticket to the game. So as we mentioned, we have better performance than InfiniBand.
Great, the deal is done.
But more than I wish the fact that we are an open system in a manner that you can basically source the hardware, let’s say the optics, you can source from anyone. You’re not locked into a specific vendor. So it has to do with price, it has to do with supply chain, time to first token.
But that’s a great point. Time to first token is inherently important, specifically when you’re building clusters for AI. And I think that point can’t be understated is that regardless of the technology stack, how you get to first token matters.
Absolutely. Yeah. So it’s not only a matter of supply chain and lead times, it’s also a matter of how simple is it to build a cluster? How much time do you need to spend on fine tuning the parameters? Which could be a lot in our case, again from experience, it is very fast. So we have time to deploy, time to first token, time to revenue, which is significantly shortened. We also deal very well with multi tenant environments. So if you have multiple workloads or multiple tenants, if you are NeoCloud, for instance, we have inherently, we can inherently separate the workloads from each other. So no noisy neighbor phenomena. Much better performance for all of the workloads and tenants. Also multi site. I mentioned scale across Good point. Yeah. So scale.
We have a great deployment doing that. Oh yeah.
We actually these days are deploying, we cannot say the customer name, but we are deploying a scale across system over multiple data centers because no one data center has enough power to power everything. And actually the workload or the GPU cluster acts as if it is in a single data center.
Yeah, I think the way we define it is one lossless fabric across a classic metro cluster. So what’s important there is when space and power become a premium asset, when you’re defining your data center strategy, you end up having the ability to really classically look at the cost of power, the cost of real estate, and know that when you’re deploying the infrastructure, you’re going to have one lossless fabric that the accessibility of those assets will not be undermined by being close together and regulated against the power cost that you have.
Yeah, I agree. And I think one last thing that customers see as value when working with DriveNets is where we come from, because people that don’t know us think we are a small startup. It’s a very risky bet. But the fact that we’ve been around for 10 years, the fact that we are deployed in the AT&T core network, for instance, over 80% of their traffic is going over our solution, means that we have a very solid solution and we actually use the same technology that we used in the service provider market for this solution. This is why we were very fast to the market with this solution. So, and this is why I think almost every customer that uses us for AI infrastructure comes back for more.
It’s the experience of solving very complex problems in very highly regulated industries that they understand that we have the experience to help them get to the point in which they are successful in their own line of business.
Want to learn how DriveNets is reshaping AI networking infrastructure?
Explore DriveNets AI Networking Solution