Rethinking Data Center Network Architecture
Meeting with NextGenInfra.io, Dudy Cohen, VP of Product Marketing at DriveNets, explains how Ethernet is evolving through advancing speeds (800G to 3.2T), co-packaged optics (CPO) integration, and scheduling layers that address lossiness to support AI infrastructure with GPU clusters scaling from 8,000 to nearly one million units. He describes how DriveNets has adapted its scheduled fabric technology, delivering superior time-to-first-token performance and lower cost-per-million-tokens compared to proprietary technologies.
-
0:00
Ethernet evolution: speed, optics, and scheduling
-
1:24
Is Ethernet really suited for very large GPU clusters in AI data centers?
-
3:02
Why is ESUN needed?
-
4:00
What is the difference between fabric scheduling and end-point scheduling?
-
6:04
DriveNets strategy: how does the scheduled fabric design fit across customer segments?
-
7:37
How important is the fabric architecture to the success or failure of data centers?
Full transcript
Ethernet evolution: speed, optics, and scheduling
So basically, Ethernet undergoing changes or evolution in multiple paths. One path is the feeds and speeds. So Ethernet has always been the fastest growing technology in terms of line rates. So now it’s moving to 800, which is the latest. Next year we will see 1.6 tera and later on 3.2 tera, etc., etc.
And it is outpacing any other technology in that manner. There is also the integration of optics into switches and into end devices with co-packaged optics, CPO. We see it happening this year. And of course, as I mentioned, the introduction of scheduling layers or additional layers on top of the plain Ethernet, which resolve its inherent problem being basically lossy technology. So scheduling either the endpoints or the fabric is helping Ethernet in tackling new use cases which were not which did not fit until now.
So it could be scale up, it could be scale out for backend networks connecting GPU, scale across connecting different data centers, and so on and so forth.
Is Ethernet really suited for very large GPU clusters in AI data centers?
So when it comes to the size of the cluster, basically sizes are growing up. For different types of customers, we see different ballparks of sizes. So for enterprise customers or NeoCloud customers, we see several thousands. Sweet spot seems to be around 8K GPUs per clusters.
But for the largest, larger NeoClouds and for the hyperscalers and for those giants that develop the LLMs, we see an exponential growth in the size of clusters, tens of thousands, hundreds of thousands, and I think soon we will cross the 1 million mark of a GPU, of the number of GPUs in the single cluster. And there, the need for a very scalable and simple-to-scale network technology is very evident. So Ethernet being the ruling technology in the data center domain and the technology that scales across the entire infrastructure of the internet is a very easy choice when it comes to scalability. There are different scales, of course. When you go to tens or hundreds of thousands, you need to go to endpoint scheduling because there, there is, there is basically no limit to the amount of endpoints you can connect.
For smaller scales, for the sweet spot of 8K, as I mentioned, fabric scheduling will provide much better results in terms of performance. But those two technologies are living together and each customer selects the different variant of scheduled Ethernet that suits his needs.
Why is ESUN needed?
Ethernet is needed in order to scale Ethernet up to the level required by scale-up. Networks because scale-up networks have very specific requirements in terms of capacity, in terms of latency, and in terms of robustness or, or packet loss or the lack of packet loss.
So ESUN is tackling those points and allowing Ethernet, which was not designed to be a scale-up protocol, to be an alternative totechnologies like NVLink and like CXL or UAL that were designed from the bottom up to tackle scale-up.
I think ESUN is the first time that Ethernet is considered a valid solution and a good alternative to those very closed and very limited solutions that were designed specifically for scale-up networks.
What is the difference between fabric scheduling and end-point scheduling?
Let’s talk a bit about the difference between fabric scheduling and endpoint scheduling. When it comes to fabric scheduling, the scheduling is done with a cell-based fabric. That means that basically any packet that goes into the system is broken into cells, spread across the entire fabric, and reassembled on the other side. This is a very simple technology, you may say, because it is controlled by the switches themselves and ensures that even with very high utilization of the entire fabric, You will not suffer from any packet loss, jitter, or delay, and this results in very good job completion time, in very good bus bandwidth if you look at the collective communication library within the server itself.
So this is the technology that provides you the highest performance when it comes to endpoint scheduling, like with the Ultra Ethernet standard. This scheduling or this management of how the traffic runs through the fabric is done by the endpoints, specifically by the NICs or the SmartNICs in the server. They are based or they are fed, they get information from telemetry that monitors the network and finds the hotspot of congestion. And according to this telemetry, data, they decide or theyengineer the traffic that goes into the network in order to bypass those hotspots and better balance between the fabric links. So here again, you get much better results than anyplain vanilla Ethernet.
This gets youthe ability to get very high utilization of the fabric. And from the workload side It shortens the job completion time because GPUs are spending less time waiting for network resources and more time processing data.
DriveNets strategy: how does the scheduled fabric design fit across customer segments?
So DriveNet started actually about 10 years ago in the service provider domain with customers like AT&T. Uh, and what we did there is we implemented this schedule fabric, the same schedule fabric, in order to build scalable routers. So we based our solution on basic white boxes, very basic building blocks.
And in order to scale them, not in a chassis format, but in a scalable format, we connected them to a fabric white box and implemented the scheduled fabric in between. So the results were, was very large routers that were deployed in service providers’ network, in the core network of communication, et cetera, et cetera. And what happened when the AI industry started exponentially growing is that we found that this scheduled fabric solution is very suitable for backend networks that connect GPU. And this is why we were very fast to implement this solution and to have working deployments with both hyperscalers, NeoCloud, and enterprise customers. So DriveNet’s strategy is basically to modernize or reinvent networking.
This is what we did with service provider. This is what we still do with service provider. And we use the same technology with some tweaks with regards to functionality and to scale with AI infrastructure, and it is very successful with both markets.
How important is the fabric architecture to the success or failure of data centers?
At the end of the day, you need to connect the infrastructure to the business. And in order to do that efficiently and in order to be successful, because let’s face it, this market is full with neoclouds, with hyperscalers, with enterprises that are rushing into investment in AI infrastructure. This infrastructure needs to be very efficient, both in terms of time to first token. So you, you need a technology, a networking technology that will allow you to to run very fast and to start producing with the infrastructure you build very fast. So time to first token is a very important parameter and it is influenced by supply chain, it is influenced by the availability of hardware, it is influenced by the simplicity of the setup and the amount of time and effort you need to put into fine-tuning the architecture.
So this is one thing. The other thing is at the end of the day, cost per million tokens. So this parameter is crucial because of the scale. When you run such large-scale infrastructure projects, the marginal cost at the end of the day is what defines your business. And here in both parameters, going to Ethernet-based solution, going to an open solution, going to very high-performance Ethernet solution makes yourroad to first token, the time it takes you to start producing from the infrastructure, and later on the marginal cost of each transaction, of each token in your infrastructuremuch better in terms of economics than any other solution.