ODCC: Scheduled Fabric for AI backend Networks – an InfiniBand Alternative
Watch the Presentation
In his session, Yossi Kikozashvili, Head of Product & Solution Engineering, AI Infrastructure for DriveNets, explores how scheduled fabric serves as the best solution, in terms of both performance and cost, for large-scale AI training clusters. It is not vendor-locked and does not require heavy-lifting of SmartNICs. This type of fabric makes the AI infrastructure lossless and predictable, without needing to bring additional technologies to mitigate congestion.
Full Transcript
Now, the most common way used to be Infiniband, right, by Nvidia and Mellanox
Then there was another way, utilizing Ethernet Clos, right?
Different spine topologies and DriveNets with a collaboration with ByteDance and Broadcom. now presenting a new, and our argument, a better way to build AI backend networks.
Now, it might not say a lot to some of you, some of you
might not be familiar with this technology, because as I said, this is the first production deployment in the world of such a technology.
But my guarantee for this presentation is throughout the presentation, you guys will
understand what is scheduled Internet and why essentially ByteDance chose to deploy it
and not the other alternatives. So before we go ahead and explain the technology, explain our positioning in the market, etcetera, let me talk a bit about the market.
You probably heard a lot of presentations about the market, about the clash between
Ethernet and InfiniBand, the pros and cons, etcetera.
But frankly speaking, the industry is speaking for itself.
The ecosystem is speaking for itself.
We all see the numbers, right?
We read analysts, we hear industry experts, everybody’s talking about it.
The AI networking is transferring from InfiniBand to Ethernet.
And every day that goes by, this shift increases.
Now, if you look at the numbers, you will see that even though today the majority of deployments that are happening specifically for AI backend networking are utilizing Infiniband by 2026, 2027, 2028, and so
on, this will shift towards Ethernet.
Now there are many reasons why to move from InfiniBand towards Ethernet.
One of them is obviously to avoid locking, right? The only vendor that really pushing infinite and forward is Nvidia, nobody wants to have a vendor lock in.
And then some more couple of reasons that forces people or forces organization to move to ethernet but it’s there now.
When this shift is happening, when this move from InfiniBand to Ethernet is happening, it’s not one to one, right? There are a lot of Ethernet technologies out there that can be utilized for AI backend networking.
And we’ll speak about it in a minute.
So before we dive into the different technologies that one can utilize for AI backend networking, and specifically Ethernet technologies, let’s start by talking about what are the needs or what are the challenges that one is facing when coming to deploy backend networking for AI.
So I divided into three different sections.
The first section is high performance, right? The first and foremost, maybe the most important thing is how do I not make my network a bottleneck? I spent so much money on GPUs, I spent so much money on the compute infrastructure, I don’t want to let the network become a bottleneck.
I don’t want to make it throw away my investments.
Right?
So AI networks and AI workloads tend to have three fundamental characteristics.
First and foremost, they use RDMA Remote Direct Memory Access.
Probably the majority of the audience is familiar with the technology, and what it
means is that the network should be lossless.
So RDMA uses an algorithm called Go-Back-to-N (GBN).
Every time a packet is being dropped, the algorithm will calculate what packet was dropped and will go back to this packet and send all the packets from the dropped packet and onwards.
Right?
So losing a packet in an RDMA-based network means losing a lot of time and losing a lot of resources.
That’s first.
Second, AI networks use, and tend to use “elephant flows”.
So you can have a full capacity of a NIC, 400 gig or 800 gig being used for a single flow for seconds or tens of seconds.
Right? And these “elephant flows” tend to have their own problems.
We’ll talk about it in a second.
Just a hint. ECMP is not good enough, caching is not good enough.
And then the third thing that really impacts high performance is the fact that when we’re building large clusters, hundreds of GPUs, thousands of GPUs these clusters tend to have a lot of hardware failures.
So leak goes down, switches goes down, a lot of things are going wrong when these clusters operates, and we need to see how we can overcome that and not spend or waste too many resources.
Now, the second thing or the second topic I’d like to address as a challenge for AI networking is flexibility, openness and TCO.
So I just mentioned that the reason everybody is leaving and throwing away InfiniBand is the fact it’s not scalable, sorry, it’s not flexible.
You cannot really have diversity in supply chain, you cannot have diversity in the chip.
You’re forced to use specific NICs and specific GPUs when working within InfiniBand, you’re forced to use with specific white-box ODMs.
So these all things that really impacts the flexibility and openness and the overall TCO of a solution are really helping us to take certain decisions and go towards specific technologies.
And the third thing is obviously trusted solutions. That’s always right.
Every time we deploy a new technology, we want to know if it’s field proven that it was tested in production, that the risk is low.
And that’s basically the three things that really impacts the decision of what
networks to deploy in AI networking.
Now, there are a few ways to build an AI backend network.
I think the sessions before me really mentioned all of them.
The first technology, and maybe today the most common one, is the non Ethernet Infiniband.
So what you have is a bunch of boxes working in a scheduled manner coming from Nvidia, and actually gives you a great performance.
Right.
The problem, and maybe the only problem with Infiniband is the fact that it is introducing a vendor lock.
Other than that, great performance, right? But again, bad PCO, bad pricing, and bad economics.
Now, the second solution is Ethernet Clos, right? A bunch of switches, layer 2 switches, sometimes layer 3.
Working together, standard Ethernet gives you great scalability, gives you great TCO
but in the end of the day, introduces really poor performance.
Now, the third way to build AI networks, and I think that’s where the industry is heading right now, especially with what’s happening with the Ultra Ethernet Consortium, is the integration between an Ethernet Clos and a DPU.
So the industry understood that the Ethernet Clos, the standard Ethernet cannot really give us the performance we’re looking for.
And so we’ll put some logic on top of a DPU and make it behave and make it a bit more performant for our needs.
Now, the problem with that technology or with that architecture is first and foremost, it’s expensive.
It forces you to use some top notch NICs, usually DPUs or super NICs, and this introduce, in turn, high costs and high expenditures.
Now, in addition to that, it doesn’t really have the best performance one can get So if you compare the Ethernet Clos, for example, with the new integration to InfiniBand, you will observe, and I, in my own eyes, I saw it in labs, in customer labs, that it’s really not performing as good as InfiniBand.
Now, the last way I want to talk about, and actually that’s really happening in the industry as we speak, is using the chassis.
Now, chassis, really, frankly speaking, are perfect in order to build AI networks.
Think about it, it’s a single hop from line card to line card.
It’s really easy to manage, really easy to maintain.
Everybody, every guy with the CCNA can really handle that.
And it’s, frankly speaking, a perfect solution.
The only problem with the chassis is the fact that it’s not scalable.
The largest chassis in the market right now really has 576 interfaces.
And that’s the topic it can give you.
And if you want to grow your cluster beyond 576 interfaces, that means you need to kind of build a cluster chassis or use a chassis to build a Clos together with pizza boxes or something in that manner.
Now, what we asked ourselves when we started our journey in the AI arena was actually, is there an optimal solution that can actually give us first and foremost the best performance and then better TCO, some flexibility some diversity? And the answer we gave ourselves was actually, yes.
DriveNets has been selling scheduled fabrics for about eight years now.
For eight years, we’ve been deploying scheduled fabrics with some of the largest tier one organizations in the world that include AT&T, Comcast, KDDI and all the big ones.
In the arena of the service providers, we utilize scheduled fabrics for service provider networks, for core use cases, for aggregation use cases, for building data center interconnect.
And frankly speaking, when the AI boom happened, when OpenAI kind of launched the race to AI, we realized something really important.
We realized that the technology we already have in hand is maybe the best fit for AI networking.
So what we did is we adjusted the technology we already had to the new use case of AI networking.
So let me explain really briefly how the technology works, because I’m not sure everyone in the audience are familiar with it.
So what we do, what our cluster is doing, is essentially scheduling traffic from end to end, from switch, or from leaf to leaf.
So “elephant flows” are entering to the fabric from different GPUs in the network.
These flows will be received by what we call an NCP.
This is the leaf card, this is the leaf white box.
The NCP will take, the flow, will split the packets into cells, or how we call it microcells, will spread these microcells all across the fabric and send it to the spinal.
Now, one of the major things that causes our technology, the DDC, the fully scheduled Ethernet, to be more performant than the alternatives is that we’re not utilizing ECMP, we’re utilizing packet spraying, or cell spraying.
So actually we’re utilizing 100% of the fabric links that we have in the cluster.
Now, the cells will be spread across, all the active links in the cluster will reach the upper layer, the spine layer, that we call an NCF.
And by the way, NCP and NCF are terminologies that DriveNets invented.
NCP means Network Cloud Processor.
Network Cloud is the name of our product.
I saw some other vendors using this terminology, and it’s a bit funny, because actually, NCP Network Cloud Processor is the name of our product for eight years now.
So cells will go to the spine layer, and then the NCF, the spine layer will take all the cells, will do simple cell switching, and forward these cells to the egress, the endpoint NCP where the cells will be reassembled into packets, and then in turn, the egress NCP will send all the packets towards the destination GPU.
So few things happen here.
First and foremost, 100% load balancing, right? We’re not utilizing hashing, we’re not utilizing ECMP, and that means we’re actually utilizing the entire infrastructure that we put down the entire fabric links that we invested in.
Second is the hardware.
The recovery of failed links in our technology, in the DDC is based on hardware.
So what happens is there are micro cells keep alive microcells running all
the time between NCPs and NCFs, and making sure that all the links,
all the fabric links are actually active.
Now, why this is important.
If you take an example of a, let’s say, 8k 8000 GPUs cluster AI cluster, you will find that every couple of minutes a fabric link goes down and every couple of minutes a fabric link starts to flap.
And what it means is impact to training time, impact to job completion time.
So the fact that the recovery in the DDC or the fully scheduled Ethernet is based on actually hardware and not software is dramatic.
So the recovery time for a link in a DDC is less than ten microseconds.
To compare that to, let’s say, Ethernet Clos, the recovery time for a link in Ethernet Clos might be seconds, because in Ethernet Clos you have to converge some protocols, such as SPP or PGP, etcetera.
Now, the third thing I want to talk about is the nature of our solution.
Remember we spoke about chassis.
Chassis have line cards and then fabric cards in order to connect all the
line cards together.
Now, what happens inside the chassis doesn’t interest the user, right? Because actually, when a packet traverses from one line card to another line card you can be sure that it will get to its destination.
And why is that?
Because the internals of a chassis, the backplate, or the fabric wall of a chassis is actually lossless, so you don’t lose packets there, right? And that’s exactly how the DDC and the fully scheduled Internet works.
When a packet enters an NCP, you can be sure it will arrive to the egress NCP.
And what this means is this solution is lossless by nature.
You don’t have to have PFCs and ECNs and all sort of kind of technologies inside the fabric in order to assure that you don’t lose packets.
So three things, 100% load balancing failover that is based on hardware and not on software, and then the fact
that it’s a lossless solution by nature.
No need to use virtualizations like PFC and a DCN in order to make it lossless or virtually lossless.
And these three things end up giving us up to 30% job completion time increase.
So if you take an ethernet class, standard ethernet class with the DPU, and if you compare it, you run a model on it, or just run Nikkei on it, and you compare it to the DDC, fully scheduled Ethernet, you will find easily in a lab that the fully scheduled Ethernet gives you 30% better job completion time.
Now, we spoke about performance, but what about the other topics or characteristics? So let’s talk about PCO.
The fact that the logic resides on the switch really says that it is agnostic to NIC or any NIC or any GPU.
What I mean by that is, if you take the best NIC and the best performing GPU that the industry can offer, and you use it with the DDC and to compare it, you will also take the worst NIC and the worst GPU that the industry can give you, you will see the performance exactly the same, the network will behave exactly the same.
And the reason is it is agnostic to NIC and GPU because the entire logic, the brain is on the switch side and not the NIC side.
So to be practical, our users are able to buy the cheapest NIC that the industry can offer.
Now, scalability.
We mentioned the chassis, and why it is a great solution.
Now.
Actually the only drawback in a chassis is the fact that it’s not scalable.
Now, what we did with our technology, actually we resembled the chassis in its technical characteristics, right?
The fact that perfect load balancing, elbow based failover, lossless internals et cetera.
But then we neglected, we left behind the drawback of scalability.
So our solution can scale essentially to 32,000 interfaces or to 32,000 GPU’s in a single cluster.
And then the fact that we’re talking about standard Ethernet, right? What’s happening from the MNCP downloads is standard Ethernet.
What happening with the NCPs upwards is frankly cell switching.
Or you can call it magic, but it doesn’t really matter because you can use or utilize any standard Ethernet NIC that the industry can offer.
Now, when comparing between different solutions, there are probably three topics you want to compare.
And we discussed about it a few minutes ago.
The performance, how does the solution perform at scale, right, the flexibility openness, and then in turn the TCO.
And then the fact that if it’s a trusted solution or a new solution.
Now, I don’t want to go over all the bullets in this slide, but essentially, I just showed you in the previous slide that with these three topics, or with these three KPI’s, scheduled Internet is simply better.
Now, we spoke about the technologies, we spoke about what it’s better, we spoke about ByteDance.
But let me introduce DriveNets.
So as I mentioned, DriveNets was selling scheduled Ethernet for a long time.
Now, we started with AT&T, we started with a use case of core, a core deployment in AT&T, utilizing a DDC scheduled Ethernet.
And ever since we’ve been selling network infrastructure and taking it to production with the largest tier one service providers that are out there, including TurkCell, Comcast, AT&T, KDDI, NTT Data, really all the big names that are out there.
And as I mentioned, when we realized that the AI is happening, when we realized that network infrastructure started to matter in the AI arena, we decided to move to this area and to this ecosystem.
So today DriveNets is working with two business units.
The first business unit is continuing to take to production network infrastructure for the biggest tier one service providers in the world.
We are having a very successful journey in the service provider area.
And on top of it, we are also working with the biggest hyperscalers and enterprises in the world in the context of AI networking and backend networking.
So actually two solutions, two business units utilizing frankly the same NOS, the same network operating system and the same solution.
Now the way we do it and the way we assemble our solution is actually very simple.
We have two different product lines or two different products in our AI business unit.
The first one is based on a Jericho2C chipset.
This is when we would like to build relatively small to medium clusters up to 1.5K GPU in a single cluster.
And then we have, and by the way, we’re the only ones in the market that really took it to production already.
We have the Jericho3 chipset by Brodco, the Jericho3 AI, and we’re running our NOS on top of it with some of the big hyperscalers in the market.
And the way we do it actually is we do it in a leaf and spine architecture.
So we can have a solution with two tier architecture where we have a leaf and a spine on top of it that can scale up to 8000 GPUs in a single cluster.
And then if we want to scale beyond that, and there are definitely customers that are asking us to scale beyond that, we can go to a three stage network where we utilize a leaf and then additional two layers of spine.
That’s where we can scale up to 32,000 interfaces of 800 gig.
Now, on top of all these complicated clusters, we realized that sometimes it’s relatively hard to manage or to orchestrate distributed systems right? So what we did, actually, we developed an orchestrator that we could denote the DriveNets orchestrator.
And it’s really a magic.
I mean, it’s what you use in order to provision your cluster in order
to zero touch provision your cluster.
That’s what you use in order to troubleshoot hardware and software problem.
That’s what you use in order to upgrade or downgrade software.
And really everything is a single touch with that or with that demo port.
So to wrap up, because this is my last slide, I want to say that even though clusters were being built up until now with some specific technologies, specifically InfiniBand and specifically Ethernet Clos technologies from that angles, there is a new way to build AI back-end networks, there is a new way to build AI networks and it turns out to be a better way.
Thank you very much.