DriveNets on Scaling AI Networks
At the AI Infra Summit 2025, DriveNets’ VP Product Marketing, Dudy Cohen, outlines three essential approaches to AI infrastructure scaling: scale up, scale out, and scale across, demonstrating how modern networks can support up to 576 GPUs in single clusters. His analysis shows how scale out networking enables unlimited GPU scalability through fabric scheduled architectures, while scale across solutions allow organizations to manage distributed GPU resources across multiple data centers as one unified system.
Chapters:
- 0:00 Intro
- 0:25 DriveNets’ role in the AI Infrastructure ecosystem
- 1:05 How disaggregated networking changes the economics of hyperscale networks
- 1:50 The importance of scale and flexibility for AI workloads
- 2:30 Where DriveNets is headed next in product innovation
- 2:43 Closing thoughts from Dudy Cohen
Key Takeaways from the interview
- Scale-Up vs. Scale-Out Evolution
Traditional scale-up models, once limited to a single server, are now expanding to racks of up to 72 GPUs, with future generations expected to reach 576 GPUs. Beyond this point, robust scale-out networking becomes essential to handle larger clusters - High-Performance Scale-Out Networking
Modern fabric-scheduled architectures with VOQ systems and end-to-end guarantees now outperform Infiniband in scale-out environments. This enables tens of thousands of GPUs across a data center to operate with high utilization and low job completion times - Scale-Across for Geo-Redundancy
To extend beyond single data centers, scale-across networking is emerging. It allows workloads to run seamlessly across geographically dispersed data centers using hybrid deep- and shallow-buffer designs to manage inter-data-center latency
Read the full transcript
We are here in Santa Clara in the AI Infra Summit and a lot of discussions here around scalability of AI clusters, AI workloads, et cetera. And the discussion is usually around scale up versus scale out and scale across, in which scale up is gaining ground with higher and higher scalability, which was always the limit for scale up because it used to be confound into one server.
Now it is bound by a single rack up to 72 GPUs and in the next generations it will grow up to to 576 GPUs across data centers. But still when you grow past this limit, you need a very robust scale out network. And this is where we come into play. And scale out network is providing basically unlimited scalability, but with the caveat of lesser performance or low performance compared to scale up. And this is where the focus on scale out network is today.
Making the performance as good as scale up and freeing the workloads and the AI infrastructure from the boundaries of scaling limits across the data center. So scale out network today. In particular, fabric scheduled architectures are very high performance. Actually they perform even better than Infiniband, which used to be the benchmark for scale out performance. This is thanks to a scheduled fabric, a VOQ system and an end to end performance guarantee.
So basically you can scale out your infrastructure across the entire data center for tens of thousands of GPUs and still maintain the high performance of the infrastructure required to make your GPUs highly utilized. And your workloads perform good in terms of job completion time, in terms of time to first token and other parameters that you usually measure AI according to when you are done with the data center, because for instance you are out of power or out of rack space, you need to scale across. That means that you want to allow geo redundancy and add GPUs in other data centers which could be miles apart from the original data center. This is where scale across across networks with this, this is quite a challenge compared to scale out because you need to have a hybrid deep buffer and shallow buffer networking infrastructure in order to allow compensation for the high latency across the inter data center connectivity.
And this is again something that has developed in the last year or so. The ability to scale your infrastructure, your GPU infrastructure across multiple data center, but still still look at it logically as a single entity and run a workload across multiple data center spread out geographically. Again, this is another topic that is heavily discussed in Santa Clara this week. Thank you very much.
Want to learn how DriveNets is reshaping AI networking infrastructure?
Explore DriveNets AI Networking Solution