CloudNets-AI
Challenges facing AI Networking Infrastructure
With the fast growth of AI workloads, network solutions used in the fabric of AI clusters need to evolve to maximize the utilization of costly AI resources and support standard connectivity that enables vendor interoperability. AI clusters are supercomputers built to perform complex, large-scale AI jobs. These systems are composed of thousands of processing units (predominantly GPUs) that need to be highly utilized. This poses significant challenges to the AI networking infrastructure, which needs to support thousands of high-speed ports (400 and 800 Gbps).
Listen on your favorite podcast platform
Listen on Apple Podcasts
Listen on Spotify
Full Transcript
Hi and welcome back to CloudNets, where networks meet cloud.
And today we’re going to talk about AI, and specifically about AI networking.
And we have our chat bot. Yeah, we have our very own yeah, artificial expert, Run, thank you for joining.
My pleasure.
So, Run, AI is a big thing lately with all ChatGPT and Bard and everything, and it’s growing.
And we want to talk about the challenges behind the AI infrastructure, because AI ML training, it’s very compute, intense task.
But also the networking, the back end networking that connects between all the parallel computers is a very, very important and critical infrastructure for AI.
So what are the three, let’s say, main challenges in AI networking that companies like the Hyperscalers that are going into this market now needs to resolve?
Okay, I’ll break it down to three things.
First off, in AI, HPC as well, application is king, and it boils down to three main ideas or three main pillars.
First off is that you have a variety flexibility of different applications running on that same network. An AI network or an AI cluster
needs to be more connected to the outside world versus an HPC, which is kind of more of a back end kind of deployment.
Okay, so classic HPC is isolated, while AI is something that has the back end and the compute and the training, but also the connectivity to the online realm.
And more varied, more different applications running at the same time, different sizes, different length, different duration of the application that’s more varied in AI, whereas HPC is more.
So basically, the network needs to be some kind of connected or online. So preferably the back end and the front end are the same technology, but also very flexible, scaling up and down
and accommodating various applications.
Yes.
Okay, this is one.
That’s number one.
Item number two is that an AI network is big.
HPC network is also big. Right.
It’s not a differentiator but AI is so big that when you take an application and run it in a very small scale, you get certain level of performance.
That’s good.
When you expand the scale almost linearly, as the network grows, the application performance degrades.
So when you talk about application performance, you talk about job completion time. Job completion time overall kind of sums it up.
In job completion time and in the networking domain, we’re talking about nonstop connectivity, predictable connectivity failure. The network needs to be completely transparent to the application.
Like I said, application is king.
The network needs to stay out of the way.
Do not disturb.
Do not disturb the application.
So this is basically number two.
Item number three is that AI networks are big.
They are deployed by the largest players in the industry.
They need to be rock solid. It needs to be technology which is open, which allows multiple vendors to chime in.
Nothing that locks you into a certain application with a certain technology, with a certain vendor, anything which is not a lock in, anything which is more field proven and proven by multiple players and for a long duration.
This is what an AI network needs. Okay, so, wow, those are some challenges.
Not easy.
Just to sum up, we will have another episode, I think, in order to understand how we resolve those challenges, because it’s a fairly big task.
But just to sum up the challenges, first of all, we’re talking about flexibility and connection online, meaning that you need to accommodate multiple
applications, you need them to be connected to the Internet in order to be interactive, et cetera. This is one big challenge.
The second challenge is performance and scale and performance at scale. That means that it should be non-stop predictable, zero packet loss, very low jitter, et cetera, et cetera.
And it needs to keep this performance as the scale grows, which is the main pillar of this channel challenge.
And lastly, it needs to be a safe bet. You don’t take chances here. You don’t rely on one vendor, you don’t rely on nonfield proven technology.
You need things to work. And as you said, you need to forget about the networking part because the compute is the king.
The GPUs needs to feel they are connected and nothing interrupts their connectivity.
The networks needs to be there and needs to be transparent.
So this is the big challenges, very big challenge and very big investment.
As such, you need to reduce the risk.
Absolutely.
So thank you very much, Run, for
this pleasure and thank you for watching.
Stay tuned for our next episode in which we will talk about those challenges but from this resolution angle.
So see you next time.
Don’t miss it.
Bye.
Solutions for Challenges in AI Networking
With the fast growth of AI workloads, network solutions need to be ready to resolve issues including having a flexible and online architecture, being able to scale and maintain performance at scale, and having a field proven rock-solid solution.
Listen on your favorite podcast platform
Listen on Apple Podcasts
Listen on Spotify
Full Transcript
Hi and welcome back to CloudNets, where networks meet cloud.
And today we’re going to talk again about AI networking. And this time we have Run, our chatbot, that will provide the solutions for the challenges we mentioned last time.
So, Run, let’s dive into it right away.
We had three challenges.
What are the solutions?
All right, so challenge number one that we had is that the network needs to be very flexible and very much connected to the outside world.
First off, flexibility Ethernet and connected to the outside world, Ethernet. 600 million ports of Ethernet are deployed.
Each and every cannot be wrong. They cannot be wrong, exactly.
You don’t need a gateway because the Internet is also interfacing to that network via, again, that same standard Ethernet. Ethernet can be built to any scale of a network that you need.
So in that sense, Ethernet is kind of a classic solution to this challenge.
As opposed to proprietary interfaces that in some cases are used in the back end.
The first time that you say proprietary, that closes the open.
Yeah.
Okay, so let’s go to challenge number two.
All right, scale.
Yeah.
Challenge number two was about very large scale, and performance, obviously performance at scale.
So in this case, a DDC type of a solution provides a fabric, whereas other solutions provide a network.
Network has an impact.
Okay.
Network has network because it has multiple hops.
Exactly.
You want a single chassis, many, many cables.
Exactly.
And network has nodes, network junctions. And these junctions bring in traffic from multiple locations onto multiple destinations.
And this crisscross or mishmash of traffic flows is an impact.
And this impacts the application.
Remember, application is king.
You don’t want to hurt the application. So those networks are not scheduled.
There suffer, packet loss and jitter, et cetera, et cetera, which results in higher JCT.
They are networks.
Networks are networks.
Exactly.
And DDC is a fabric.
Fabric is a fabric.
It’s that simple, right?
Okay, that’s challenge number two.
Okay, so we want DDC. We want DDC.
Now the question. Which DDC?
And that was, in a way, challenge number three.
What you would like for your AI network is something which is robust, something which is reliable, something which is field proven.
And this is exactly where DriveNets comes into play with DDC AI-3.0, version 3.0 of the solution.
Actually, this is the same DDC that we used in AT&T only optimized for AI networking.
Hence the 3.0.
It’s actually the third generation.
And the second generation, which is basically the same architecture, is field proven.
We talked about AT&T numerous times.
This is what’s running the core of AT&T’s network in America.
Right?
So it’s field proven up to almost 700 terabytes per second and 3.0 will have and more.
Exactly.
So you can actually pull this up to 32,000 endpoints, Ethernet endpoints.
So that’s how large the cluster, how many GPUs you can interconnect directly to one fabric.
That’s huge.
That’s massive.
Okay.
And for sure, a lot larger than any chassis.
Not the same game. Not the same game.
Okay, great.
So now we are at ease because those main challenges of yeah, we solved, are resolved.
So just to recap!
The first one was to have a flexible and online architecture, which means you want an open Ethernet standard and not some kind of proprietary interface or protocol in the back end network.
The second one is that you want scale and performance and performance at scale. And this is where the DDC fabric, which is scheduled and predictable comes into place.
Unlike nodes and networks that bring in a lot of jitter and a lot of packets loss and nothing that resemble predictability.
And the third challenge is having a field proven rock solid AI Networking solution. And after we’ve established that we want a DDC, we now know that we need DriveNets DDC because this is the only DDC that’s actually field proven. AT&T and many others are already using it.
Thank you very much for resolving those questions.
Thank you for watching.
See you next time on CloudNets.
Thank you very much.
Cheers.
Achieving Performance at Scale for AI Networking
For AI Networking you need performance, very high performance and you need very high scale. But you need this performance to be consistent even in this high scale.
Listen on your favorite podcast platform
Listen on Apple Podcasts
Listen on Spotify
Full Transcript
Hi and welcome back to CloudNets, where networks meet cloud, in this case the AI cloud, because we are going to talk about Network Cloud-AI
and AI networking backend fabric in general.
And we’re going to talk about a specific phrase we are using, which is Performance at Scale.
Because for AI Networking you need performance, very high performance and you need very high scale.
But you need this performance to be consistent even in this high scale.
And we have our AI expert at scale, at scale, Run our chatbot to explain how is the Network Cloud-AI making this performance at scale.
So let’s start with DDC where performance at scale kind of applies to DDC inherently, right?
AI brought this problem to a higher level, higher scale and definitely higher performance.
So first off, when we talk about AI Networking, we’re talking about DDC as a fabric versus the alternative which is actually a network.
When you have a network, you have all sorts of network issues, protocols, negotiation, distribution of messages,
because it’s within an Internet network which has multiple hops, multiple hops and multiple brains above all.
And when you have a fabric, there is no path that you need to choose.
You just run throughout the entire fabric.
Like we’re connecting all the GPUs to a single very large ethernet node or switch.
In fact, it is what we’re doing.
It is what we’re doing.
So that’s item number one.
Item number two is that we are introducing a new level of scale.
First off, it’s a new ASIC that we are using the Jericho 3 and Ramon 3, which kind of boost up the overall capacity bandwidth-wise of the DDC solution.
Second is we are introducing a topology which has two tiers of a fabric, one above the other.
So in terms of fan out, we can reach up to 32,000 endpoints in this case.
That’s a lot of that’s a lot of endpoints.
Yeah, 800 gig each.
Okay, so that’s topic number two.
And third is when we’re talking about failover.
When you have a network which is at a capacity of 30,000 nodes, failure is not the exception, it’s the steady state.
You always have somewhere some sort of a failure scenario.
And you need a network that knows how to identify these issues as fast as possible and react to these.
When you have a fabric, you don’t need to divert the path from the failed path onto a new one because there is no path, just use the entirety of the fabric.
So you just need to react by reducing the broken link.
And there you go, there you have it.
So failure recovery is instant, we’re talking about microsecond level of recovery from failure.
Okay?
And this is very important for AI because if a job fails, it needs to go back to the beginning or to the last checkpoint failure impact,
which means JCT is badly degradated.
AI is not really a forgiving topology, a forgiving application.
It hits you.
When you hit it back, it hits you.
So this is why Network Cloud-AI is a very good solution for AI networking and AI fabric.
Three reasons.
One is the performance, which derives from the fact that it is not a network, rather a very big, very scalable fabric or chassis distributed chassis, actually.
So you have one ethernet hop from any GPU to any GPU, and everything inside is a cell based fabric which has a lossless connectivity.
The second is scale.
Scale is achieved with a new chipset from Broadcom, the J 3, J 3 and Ramon 3.
And with a new architecture with a two tier fabric, which means we can scale up at this point to 32K GPUs or 800 gigabits per second ports.
And the third is failover, which is super important.
It’s almost a seamless failover because everything is managed within this fabric and you do not need to reroute or rehash the tables.
And that means that we’re talking microseconds.
Microsecond always on behavior.
Yeah.
Okay, so this is very interesting.
We’re going to talk about each of those topics in length in a separate video series.
Stay tuned.
But for now, thank you very much, Run, thank you.
My pleasure for watching.
See you next time at CloudNets and
CloudNets AI.
Bye.
Cell-based AI Fabric
Ethernet is a lossy technology. What one needs is a lossless and a predictable fabric. That will still be Ethernet towards the external needs to connect standard-wise to any other peripheral equipment. But on the inside it has to be something predictable and the implementation of a cell-based fabric meets that function and the cell-based fabric changes the way one ‘sprays’ the incoming packets between the different spines.
Listen on your favorite podcast platform
Listen on Apple Podcasts
Listen on Spotify
Full Transcript
Hi and welcome to Cloudnets AI.
This is a special miniseries spin off of CloudNets in which we’re going to talk deeper into AI and AI infrastructure and we have our very own chat bot, Run.
Hello our AI specialist.
Today we’re going to talk about cell based AI fabric.
We’ve been talking about that for cloud AI and how a cell based network fabric is better performing than the alternatives.
But today we want to touch on why and how, How come?
So yeah, how, how is different than any other Internet based alternative.
OK.
Well, let let me start with the fact that Ethernet is a lossy technology.
When connecting multiple Ethernet devices one to the other, you create a network.
That network is essentially lossy.
It’s a best effort idea.
You send the traffic and you hope for the best and you have all sorts of mechanisms to kind of back up in case that traffic got lost or or what not.
What you need is for AI workloads.
What you need is a lossless and a predictable fabric.
OK that will still be Ethernet towards exactly, it would still be even towards the outside world needs to connect standard wise to any other peripheral equipment.
But on the inside of it it has to be something predictable and the the implementation of a fabric based cell based fabric is one that implements that That function OK and and the cell based fabric is also changing the way we spray the incoming packets between the different spines.
We we do we do not do hashing and bouncing we actually do spraying which is equal and even this is this is this is an approach where in Ethernet you typically get a a load of traffic multiple packets or multiple different flows and then you kind of you apply a hash function on top of this and then you distribute or hash actually the the you select an aggregation device by which that traffic is going to be traversed to the the receiving the receiving end that’s that’s known as as hashing.
When you have hashing and then you have different types of flows sometimes you have very large flows known as elephant flows so large elephants and small mice.
So you can have a kind of a lack of balance between the different aggregation devices inside your network.
So the result would be that one aggregation device can be exhausted, another could be left idle and the entirety of the network will be perceived as if there is a congestion situation.
Traffic though it has enough even though the resources can be as low utilized as 10% and still you will see you will see losses.
And this is not something you can afford when you’re talking about AI workload.
You invest a lot into your network and then you only utilize a small portion of it.
What what the cell based approach enables you to do is to spread the traffic across all of the multiple spines or aggregation devices that you have.
So you can keep the utilization level of all these devices on the exact same level.
When traffic is increasing, the entire network increases.
That’s what’s known as a cross bisection or bandwidth right.
So everything kind of goes up as as 1 plateau or goes down as one plateau.
So it’s it’s a nearly perfect load balancing.
close to perfect.
Yeah, perfect balancing method.
That’s the idea.
OK that’s cool.
And and this basically allows you to be predictable regardless of the overlay application, Correct.
Right.
That’s that’s like the the third well potentially I would say that’s the the outcome of this.
When you’re running an AI workload, you don’t know what’s going to be the application that’s running on top of it.
And and even more so, even if you know what kind of applications you’re running today, you will not know what are the applications that you’re going to be running in the near future because there are a multitude of researchers working on this.
You build a billion dollars worth of of an infrastructure and then you have lots of researchers working on top of this.
Sometimes you even outsource that infrastructure as a cloud resource so you don’t even know who is there that researcher.
So the next Gen.
application that’s going to run on top of it is definitely by definition an unknown.
So and you don’t want to fine tune your network according to each new application.
It goes, it goes, it goes beyond.
You don’t want to, you cannot.
It’s practically impossible to have a tuning team which is working directly towards so many different researchers.
Everybody’s doing the different thing and you don’t know what’s going to happen tomorrow.
So it’s practically impossible to kind of fine tune.
You need to have something which is agnostic to the type of flow without any surprises in the network.
OK.
So I think it makes sense and I think we came up with three main points as as we typically why why cell based fabric is so much better than the alternative when it comes to Ethernet for AI fabric or AI infrastructure.
The 1st is that it’s lossless as opposed to the lossy nature of an Ethernet network as opposed to a single node.
The the 2nd is that we have near perfect balancing so we can utilize the infrastructure practically and spread everything evenly across the different spine and and the third one is that we are actually agnostic to the overlaying application or workload and in that manner we are future proof because the next workload will behave differently by definition than the current ones.
The network will have the limit in the research or the innovation.
Exactly.
OK.
Thank you very much Run.
Thank you for watching.
See you next time on CloudNets-AI.
Avoid congestion in AI workloads
Congestion demands attention, otherwise it can result in higher latency and packet loss. The main dilemma around congestion is around avoiding it or mitigation. We’ll look at how scheduled fabrics make your AI infrastructure lossless and predictable, without bringing additional technologies to mitigate the congestion.
Listen on your favorite podcast platform
Listen on Apple Podcasts
Listen on Spotify
Full Transcript
Hi, and welcome back to CloudNets-AI, our special miniseries spinoff of CloudNets, in which we talk about AI, but into a greater detail. Yeah. And teach you, hopefully, some things.
So today we’re going to talk about congestion and the two methods of dealing with them. One is dealing with them and one is avoiding them. Like Einstein said, wise men avoids a situation that a smart man would not have. But you get our grip. If you have an issue, you can avoid it altogether, or you can wait for it to happen and then deal with it. And what we are trying to do is to be the first one.
So we eliminate congestion altogether with the AI networking fabric, instead of waiting for congestion to happen, which it will eventually, when it comes to Ethernet, and then invent some mechanism to mitigate it. Some of them are very good, but still, if you avoid it at the first place, it is better because, and this is where Run comes in and explains, let’s put it this way, when it comes to a network, the idea of a best effort always kind of persists. That’s the basic principle. You send traffic and you assume that everything is going to be all right. And then when something fails, and something always fails, definitely in large infrastructures, then you start to react onto it. And now everything kind of rotates around. How fast is your reaction? How good is your way of touching or collecting the inputs, all the indicators of what’s going on in the network. From the very basic risk run Smith mechanism, and onto very sophisticated telemetry systems. Alternative routes, resending of the packet, collecting of telemetry, tracking of buffers or buffer indicators of to what extent the buffer is consumed at a given time. All of that are methods in order to understand that there’s something happening currently in the network and then reacting. Let’s try to solve it. Let’s try to fix the problem after something already cracked. Okay?
And on the other hand, when it comes to a cell based fabric or. A scheduled fabric, so the logic when it comes to a scheduled fabric is let’s not send the traffic before we have a guarantee that that traffic is valid, that it’s available for the network to absorb it throughout the entirety of the path of that fabric, which is basically the concept behind chassis. Right? This is how a chassis is built. A chassis, let’s kind of zoom out.
A chassis is a device which was built to be fit into a network. So the network has all sorts of network behavior, but every network component is expected to work as a guaranteed device. Exactly. So when you build a chassis, although there are multiple components, inherently, you build all these mechanisms into the chassis, because again, it needs to behave like a network component. You don’t have a congestion on the chassis back plane. Right. That’s exactly what that internal mechanism solves to you. Because again, a chassis is one network element in a larger network. Okay, so let’s talk about this mechanism which was implemented in a chassis, and now it is implemented across the entire network with the DDC. The logic is that there is a scheduling mechanism. It’s called a virtual output queue. There is an indication from the output device, the receiving side of the traffic, indicating to the transmitting side that the network in its full is capable of accepting this amount of traffic, and only then traffic is being transmitted into the network. There is a situation called head of line queuing. When you have multiple devices sending into one device, congestion is not caused by one entity, but by multiple entities, all sending even not to a specific destination. But there is a junction somewhere in the network which absorbs a lot of traffic, and then that traffic spreads, but that junction This is the noisy neighbor problem. Yeah, and then when you kind of propagate that congestion back to the sending side, when you’re just blindly blasting all of the senders that there is a congestion, then you might impact a good neighbor because traffic, because noise is coming from another neighbor, right. And you want to impact that neighbor only. So that’s kind of the scenario known as a noisy neighbor.
When you have an inherent VOQ virtual output queue mechanism built into your fabric or network, then you avoid both these problems, head of line queuing and a noisy neighbor. And you also better utilize the fabric because you do not send and waste fabric resources on a packet that will not be able to fulfill its journey. Right? Right. One of the criteria for a good solid network is that cross bisectional bandwidth, how much actual bandwidth is running through that aggregation layer of the network. One level is to bring that cross bisectional bandwidth level higher. When you’re sending traffic, you count that traffic as part of the cross bisectional bandwidth when it gets to the end and then it gets dropped and it needs to be retransmitted or even cause a backlash traffic going back to the source, you exceed the capacity of the cross bisectional bandwidth. You count it, but it’s useless traffic. Right. It’s a false utilization of the cross bisectional bandwidth. So you only want to send It’s throughput not good put. It’s throughput versus good put. Exactly. So you want to kind of keep that cross bisectional bandwidth measurement, a real measurement, and not just false good indicator of something which is essentially not working well. So this sounds optimal because we avoid a problem which we would have a very hard time resolving, and we get great performance.
Is there any trade off? For instance, what about latency? Because if you measure specific ASIC latency, you might see lower results on ASIC that was aimed for a network and not for a chassis. But is it the right way to look at it? To a certain extent it’s somewhat true. You could measure the lowest possible outcome of a latency and then a fully scheduled mechanism might push that minimum latency a little bit higher. But when you’re looking at the application, and explicitly for AI networking, that’s not the way to measure latency. You want to measure the tail latency, like the last traffic that traverses through the network and results at the receiving end. Where a calculation is being done is always the one that dictates when will the calculation begin. Right? So you need to get all of. Your, for that matter, this is what affects JCT job completion time in AI workloads. So what you want to measure when you’re measuring latency is in fact the jitter. What’s the difference between the lowest latency and the highest latency? That’s the count that matters the most. And when you have a fully scheduled mechanism like a VOQ, then that latency variation, the jitter is minimized. Right? And so you can measure, in certain scenarios, you can measure a very low latency when you’re using kind of a plain Ethernet network or a very high latency. So even if one packet arrives earlier, still your workloads wait for all the packets to arrive. It’s meaningless. It’s a basketball team. First player is in, meaningless. Okay, thank you very much, Run.
This was our pitch about congestion avoidance versus mitigation. We talked about better utilization of the fabric. We talked about VOQs ensuring that the packets can come through to the destination port. We talked about latency. And in general, we talked about how scheduled fabrics make sure that your AI infrastructure is lossless and predictable versus additional technologies that handle congestion and do not avoid it. There are lots of development around this area of solving the congestion once it happens, avoiding it simply makes it better work. Yeah. Thank you very much, Run. Thank you for watching. We’ll be back with more with CloudNets-AI. See you then.
Failure Recovery
Failure recovery is a very big issue when it comes to AI clusters because there are always failures and when the failure come, it’s a big thing because you need to stop the calculation, go back to the last checkpoint. You lose a lot of time and money and resources that are spent idle and and wasted time. And the networking part is crucial in order to create a fail.
Listen on your favorite podcast platform
Listen on Apple Podcasts
Listen on Spotify
Full Transcript
Hi, and welcome back to CloudNets-AI, this miniseries spinoff we have in order to go deeper into AI infrastructure and
specifically AI fabric.
And today we have a very special guest star, Yuval, our
head of product.
Hi, Yuval. Thank you for joining.
Thank you. Thank you for having me on the show.
Exactly. Well trained. So, Yuval, we want to talk about failure recovery, and failure recovery is a very big issue when it comes to AI clusters, because, as you know, there are always failures.
And when a failure come, it’s a big thing because you need to stop the calculation, go back to the last checkpoint, you lose a lot of time and money and resources that are standing idle and wasted time, et cetera. And the networking part is crucial in order to create a fail safe environment. And our Network Cloud-AI provides a predictable, lossless, very fast convergence. Failure recovery.
How do we do that?
Done! You gave all the answers.
So maybe let’s talk first about the problem that cloud providers will experience or experiencing today when trying to build a big planning cluster.
First of all, you went out and purchased a very large amount of GPUs.
You can say 8000, 16,000, maybe 32,000.
That’s millions of dollars of investment in infrastructure that needs to be 100% utilized all the time.
Now, what’s the problem?
Like you mentioned, what’s the problem they’re trying to solve, or why do they need to take care of failure recovery when there is a failure today can take any infrastructure or any architecture like Clos or InfiniBand. If there is a failure, you just stop. Now, when you stop, money is being spent, and what you’re trying to take a look at now is how fast can I recover my service or my model training, or the specific layer that I’m trying to calculate, bring it back
into action so I can run my GPUs back.
So from the networking side, is it safe to say that we want to be below the threshold above which the entire job needs to be restarted and reset it to the next checkpoint?
Yes, there’s time to recovery, and you need to be below that. You want to make sure that every time, if there is some kind of a failure, whether it’s the spine itself, whether it’s connectivity, maybe to GPUs, or maybe between the leaf and spines, if there is a failure there, you want to make sure there is no interruption to the model or the job that’s running.
So you want to keep that running at all costs. But what happens today is most cloud providers trying to solve that, not on the infrastructure itself.
They’re trying to solve it on the endpoints. They’re trying to change the way they’re building checkpoints.
They want to bring the storage back or closer to the actual GPUs to make sure the copy of the checkpoints is faster than they used to have before.
So they’re trying to do a lot of changes in their infrastructure, but it’s not the actual fabric. What we are offering with our solution is the fact that we have a lot of
advantages that actually in most cases make sure that the failure recovery is seamless and you don’t know that there is a failure on the fabric itself and the job keeps on running. So it’s kind of maybe a bit redundant to invest a lot of efforts in trying to building all kinds of mechanisms that bypass that and just invest in the actual fabric that gives you those flexibilities. So you have speed up between the leaf and spine, so you have multiple links and you have cell spreading. If one of them fails automatically, there is detection by the hardware less than a millisecond. And it switches all the traffic all to the remaining uplinks, so there’s no impact to the actual job.
Okay, so let’s just explain the term speed up. That means oversubscription of the uplinks versus the ingress traffic. So we have more fabric links than we need. So we can accommodate any failed link with the rest without. Yeah, so let’s take an example. You have a lift and you have 20 GPUs connected to it with 400 gig links.
Now usually in a cluster topology, the same 20 links you have, which we call them downlinks towards the GPU, are going to be the same 20 uplinks from that leaf to any other spine that you have in a Clos topology. When you’re talking about DDC, you’re going to have more, maybe 10%, maybe 15%, 22 links, 24 links. That means that in case of a failure of going down from 24 links to 23 links, nothing is impacted because the amount of traffic you have is only for 20 links. Okay? So it means there’s no impact to the actual traffic. That’s one point. The other point is how fast can you actually move traffic
from that link, number 24 to the other 24 links?
First you need to detect the. You need to detect it. And we have a hardware mechanism that has very fast detection. And using software, obviously you move the traffic onto the other remaining links and that needs to be seamless so the job is not impacted.
And I think that’s the key point.
From our perspective, if there is a failure in the infrastructures, and there is a lot of failures on the infrastructure, especially in a very big 16,000 GPU environment. There’s hundreds of leafs and spines, there’s thousands of GPUs, there’s going to be failures. It’s like a very big network. So every time you experience a failure in a leaf, a failure in a spine, a failure in one of the uplinks, if
it’s seamless and the job doesn’t even notice there was a failure, then you’re not losing money.
And I think that’s the key point. You need to make sure the GPUs are 100% running at all times. Okay? So we detect the failure much faster than any external entity that monitors it because it’s hardware based.
There’s also the convergence and reroute speed, because if you run some kind of internal gateway protocol in order to sync between the leafs and spine, it will take time for it to converge while we do it hardware assisted, so it’s immediate.
You’re right. So if you try to compare what we have in terms of hardware and software solution versus the other alternatives in the market, you might have an SDN controller like solution
which monitors the entire network, but then it takes time to detect the failure and notify the entire network.
That now needs to be some kind of reconvergence. You can use BGP across the network, but then again, it’s a routing protocol used to converge routing entries for Internet, not specifically for
AI workloads. You need something fast, very fast, sub millisecond. What we have is detection using hardware. So there’s no external controller that does that. It happens immediately locally on every one
of the boxes. The other section is a software solution that we build that is very, very fast and synchronizes the entire infrastructure. There was a failure. Move traffic aside, and that’s a key point, because every decision is made locally on each one of the boxes.
They don’t need to wait for the entire network to converge. So once you have that hardware detection and that software that makes that decision, you’re much faster than any alternative in the market.
So on all the steps of detection, decision propagation, we provide a very fast convergence as opposed to any alternative.
And this is basically how we stay below the threshold that affects the upper layer work.
Exactly. So instead of trying to fix or revert the failure once it happens, we’re trying to avoid the failure from happening.
And that’s the key point.
We want to save money, pretty much.
Absolutely.
Okay, thank you very much.
Thank you.
Thank you for watching. This was how
we handle very fast fault recovery on
all levels, on detection, on decision, and
on propagation. And the bottom line is
we allow the workloads to work seamlessly
and not stop and go back to
the last checkpoint.
Thank you for joining us. Thank you for watching.
We’ll be
back with additional CloudNets-AI soon.
Thank you. Bye.
CloudNets-AI: AI Network Fabric
We have an issue with resolving the AI fabric, or an AI networking problem with large clusters of GPUs usually used for training. This episode looks at the issue with resolving the AI fabric and explores how an Ethernet based solution can resolve this – by building a chassis which is distributed (a disaggregated, distributed chassis). This approach has no packet loss, is lossless and is a fully scheduled fabric, but without the scale limitation of a chassis.
Resolving the AI networking problem with large clusters of GPUs
The three issues are that the problem itself is derived from the fact that we use RDMA, which means that we need a lossless scheduled, high performance fabric. The endpoint scheduling solution, and the network based solution for resolving this issue.
Key Takeaways
- AI networking problem: derived from the fact that we use RDMA, which means that we need a lossless scheduled, high performance fabric, and the elephant flow nature of the information distribution within the cluster
- Endpoint scheduling: relies on the endpoints, which needs to be very smart, very compute and power savvy, like DPUs.
- Network based solution: The network based solution is practically building a chassis which is distributed, hence disaggregated, distributed chassis, which means you have no packet loss, a lossless and fully scheduled fabric
Full Transcript
Hi, and welcome back to CloudNets-AI, where networks meet cloud.
And today we’re going to talk about
AI and specifically about AI network fabric and the different ways to implement it.
We have our AI network fabric expert, Yossi.
Hi, Yossi.
Hey, everyone.
Thank you for joining us.
Thank you for having us.
So we have an issue with AI
network fabric, right.
We have some requirements.
We have some specific things we need to know about.
Let’s understand first, what is the problem?
What problem are we trying to resolve?
Great question.
Essentially, AI networks rely on two fundamentals.
First one, they use RDMA
Memory Access.
Now, the reason AI networks use RDMA
is because we want to reduce latency
of read/write operations as much as we can.
Now, the second thing we have, or
the second characteristics we have in RDMA
or AI networks is elephant flows.
These folks, the GPU’s that participate in a cluster, usually send very long flows of data.
Now, these two characteristics that I
mentioned, the RDMA nature of it and the elephant flows, causes several problems.
You want to talk about it?
Oh, yeah, please.
Okay.
So essentially, RDMA is not tolerant for loss.
RDMA works with an algorithm called Go-Back-N (GBN).
So what happens is we lose a
lot of time, and time is expensive
when you’re talking about AI networks.
Because job completion time means the
utilization of the GPUs.
And this is very expensive.
Exactly.
So first thing first, you’re not allowed
to lose any packet when you’re talking
about AI networking or RDMA networking in
specific.
Second thing I was mentioning is elephant
flows.
Now, the problem with, with elephant flows
is that they naturally have low entropy,
right?
Which means you cannot efficiently load
balance, which means you will have packet loss,
which contradicts the first.
Exactly.
So essentially what happens is with the
classic or standard ECMP or hashing
mechanism that you have today, what
happens is, is you bombard some specific
links in your network, while other links
or other resources in the network are
essentially idle.
Okay, so we understand the problem, we
understand what we need to do.
And basically there are two main
philosophies about how to resolve this
problem.
One is based on the endpoints and
things we do there in order to
mitigate congestion and to ensure all the
things you mentioned.
The other is based on a fabric,
the network itself.
So let’s talk about both of them.
Let’s start from the endpoints.
What do we do there?
Yeah, so you mentioned perfectly, you have
two types of approaches.
The first type, which is endpoint
congestion control or endpoint scheduling
mechanisms, are talking about how to solve
the problem
once it occurs.
Okay, then you have
a type of solution that is talking
about how to proactively prevent the
problem from happening.
Let’s deep dive into the NIC based
or the endpoint based solution.
So if you look at the industry
today, you’ll see all types of vendors,
you’ll see NVIDIA offering their
SpectrumX, you’ll see all sorts
of collaborations between switch vendors
and NIC vendors trying to
somehow integrate the NIC into the switch
in order to solve it.
But there are a few fundamental problems
with it, right?
First one, it’s very costly, right?
It’s costly because the SuperNIC or the
DPU costs a lot of
money, right?
That’s st.
nd, it’s costly because the DPU usually
consumes a lot of power and a
lot of cooling that it needs.
And that’s basically the worst solution
you can choose.
And I’m being specific here when you’re
talking about TCO, okay?
Now it solves the problem of elephant
flows because then you can activate on
your network some kind of smarter load
balancing mechanism like some kind of
packet spraying and stuff like that, and
then somehow reorder it on the NIC
side, right?
So it does solve the problem, but
then it introduces another problem that we
haven’t mentioned.
And this is an operational problem.
Now the operational problem that I’m
referring to is specifically with fine
tuning the network.
If you want to have a good
congestion control mechanism or a good
reordering mechanism in your endpoints,
you need skillset, you need people to
maintain that, right?
You need people to go ahead and
prepare your infrastructure every time you
want to run a model.
So it solves the problem.
On the technical side, it’s costly and
it requires some decent
expertise and decent skillset.
Okay, so this is a valid solution,
but it has its flaws.
What about resolving it or avoiding the
problem altogether in the network, in the
fabric itself?
So if you ask me when I’m
talking, when I’m, when I’m thinking about
AI networks, I think the optimal solution
would be chassis, right?
Yeah.
Imagine just taking a bunch of GPUs
and connecting it into one chassis, right?
And then the chassis does all the
magic.
Yeah, it’s a single hop Ethernet.
the backplane is…
Exactly, it’s a single
hop from NIFT to NIF or from
port to port.
Right.
From Ethernet port to Ethernet port.
It has no congestion in it.
Right.
The connection from the NIF to the
fabric and then to the NIF is
end to end scheduling and you lose no
packets.
And in terms of operations in terms
of how to maintain that.
It’s.
Plug and play.
It’s given every guy with a CCNA
can do it.
But if I have , GPUs, there
is no such chassis.
Exactly.
Now that’s the major problem.
And we have a solution for that.
You want to hear about it?
Oh, yeah.
Never heard of it.
So essentially what we did with DriveNets
is we took a chassis, we distributed
it, we disaggregated it, but that’s a
whole different topic.
So we disaggregated it, we distributed it,
and essentially we made it scalable to
an extent the industry have never seen
before.
Right?
So essentially our solution is based on
two building blocks.
We have the NCP and NCF, while
NCPs are equivalent to the old fashioned
line cards, and the NCF is equivalent
to the fabric board that we’re used
to from the back plan of the
chassis.
And essentially this distribution of the
chassis gives us the benefits of a
chassis which is end to end VoQ
system, fully scheduled, lossless by
nature, and then scalable.
In fact, you can have a chassis
like solution, which is optimal in terms
of operations and in terms of technical
abilities in AI networks, and you can
have it scale up to , GPUs
in a single cluster.
Okay, this is very cool.
So thank you, Yossi.
This was mind blowing.
The three things we need to remember
about resolving the AI fabric, or
AI networking problem with large clusters of
GPU’s usually used for training, is .
One, the problem itself derived from the fact
that we use RDMA, which means that
we need a lossless scheduled, high
performance fabric, and the elephant
flow nature of the information
distribution within the cluster, which
means classic load balancing like ECMP
would not work.
So we have two solutions.
The second and third point.
The first solution is endpoint scheduling,
which relies on the endpoints, which needs
to be very smart, very compute and
power savvy, like DPUs.
This is a congestion control or congestion
mitigation solution, which
bring you so far in terms of
performance, but it costs you a lot
and also is very complicated to manage.
This is coming from vendors like NVIDIA
with their SpectrumX, and also other
vendors that are cooperating in the Ultra
Ethernet Consortium, for instance.
And the third point is the network
based solution for resolving this issue.
The network based solution is practically
building a chassis which is distributed,
hence disaggregated, distributed chassis,
which means you have no packet loss,
a lossless and fully scheduled fabric, but
without the scale limitation of a chassis.
And this is coming, of course, from
DriveNets with our DDC, but also for
other vendors.
We will talk about Arista DES in
the next movie.
So this is what you need to
remember.
Thank you very much Yossi.
Thank you for having me and thank
you for watching.
See you next time on CloudNets.
Comparing the industry’s leading scheduled fabrics
Arista recently launched their DES solution, the Distributed Etherlink Switch, which is essentially an end to end VOQ system – a large scale chassis. The approach is basically saying we should put as much logic as we can on the switch side. The switch will handle all the congestion control, all the reordering of packets, the load balancing of packets, which are all necessary for AI networking. So why choose DriveNets DDC over those solutions?
CloudNets-AI: Scheduled Fabric
Three things you need to know about the latest development in AI networking or AI backend fabric.
Key Takeaways
- DES: Arista has launched the DES, the Distributed Etherlink Switch, which is basically the same scheduled fabric concept as the DDC or the DSF or any other scheduled cell based fabric out there.
- Differences: The differences between the DES and the DDC from DriveNets. DriveNets is disaggregated so you are not locked into a specific hardware or optic vendor.
- DDC: The Distributed Disaggregated Chassis (DDC) from DriveNets. It actually works not only in service providers, but also in hyperscalers running AI workload, and it’s a great solution.
Full Transcript
Hi and welcome back to CloudNets-AI, where networks meet cloud.
And today we’re going to talk again about AI Networking Fabric and specifically about something
Arista launched just the other day, the DES, the Distributed Etherlink Switch.
And we have our Arista expert here, Yossi, thank you for joining.
Thanks for having me.
So Yossi, the DES sounds a lot like our DDC.
What is it?
What is the difference?
Okay, so let me take a broader view if I may.
Arista did launch their DES solution, the Distributed Etherlink Switch, which is essentially an end to end VOQ system.
You can call it a large scale chassis.
Right?
That’s what it is, it’s a fully scheduled fabric.
Now if you look at other companies that are competing in this AI networking market, like Cisco for instance, they also launched something they call the DSF, which is Distributed Scheduled Fabric.
And again, same thing here, Arista is doing it with some specific vendor chipset and Cisco is doing it with their own chipset, the Silicon One.
Right.
So basically DDC, DSF, DES, same thing.
It’s a different terminology for the same approach, let’s put it this way.
And the approach is basically saying we should put as much logic as we can on the switch side.
Right?
So the switch will handle all the congestion control, it will handle all the reordering of packets, it will handle the load balancing of packets, which are all necessary things when you’re talking about AI networking.
Okay, so basically we have a scheduled fabric.
So from DDC they have the distributed and they add the chassis, not so much the disaggregated.
Because their solution is monolithic from that black box.
Yeah.
Okay, so why choose DriveNets DDC over those solutions?
What do we have, that they do not have today?
Okay, so first thing first, we have the extra D, which means we are
Disaggregated.
Right?
DriveNets is a software company.
We do not manufacture or design hardware.
We work with multiple ODMs that are out there.
And so no vendor lock, the freedom.
Exactly.
No vendor lock.
Not in the optic side, not in the whitebox side.
We completely disaggregated.
The second thing is production experience DriveNets has been running these systems, the DDC/DES/DSF in production networks of some of the biggest service providers, T1 service providers there are that are out there in the last, I would say seven years.
Okay, right.
And also in hyperscalers that are running workloads.
You got it.
In the past year and a half we have gained tremendous experience with AI networks, specifically with Tier 1 OTT or hyperscale companies.
You’re right.
In production environments, this is very important.
And third thing I want to ask you is.
Okay, so at the bottom line, what do you say about DES?
I would say it’s a great solution.
Same goes for Cisco, by the way, the DSF.
So yeah, it’s a great concept.
Yeah, the concept is right
and we encourage it.
So, three things you need to know
about the latest development in AI networking or AI backend fabric.
One is that Arista has launched the DES, the Distributed Etherlink Switch, which is basically the same scheduled fabric concept as the DDC or the DSF or any other scheduled cell based fabric out there.
The second is that the difference between the DES and the DDC from DriveNets is (A) that we are disaggregated so you are not locked into a specific hardware or optic vendor.
The second is that the DDC from DriveNets actually works not only in service providers, but also in hyperscalers running AI workload.
And the third point is that it’s a great solution.
We do appreciate the industry endorsing this concept.
Cisco, Arista, I believe more to come and we think this is really the best solution for your AI training workload.
So thank you very much, Yossi, for this fascinating conversations and thank you for watching.
See you next time on CloudNets.
Tail Latency
hat’s the importance of latency in AI networks. It’s time to rethink our approach to latency – not as an inevitable limitation but as a solvable challenge. Latency – the delay in data transfer between systems – is a critical factor in AI back-end networking. In AI back-end networking, different types of latency metrics exist, including head, average, and tail latency. Understanding these latency types and their effects on packet loss and packet retransmission is essential for optimizing AI system performance.
CloudNets-AI: How do you optimize tail latency?
The three things you need to remember about latency are that they there are multiple types of latency, that tail latency is the most important one, and that the solution is a scheduled fabric – a fabric that connects the GPUs that is scheduled and can assure that all of the packets or most of the packets are arriving around the same time.
Key Takeaways
- Types of Latency:In AI Back-End Networking, there are different kinds of latency that can impact AI workloads: head latency, average latency, and tail latency
- Tail Latency is Critical:This is particularly important in inference tasks. High tail latency can create delays and inconsistent user experiences, ultimately limiting model performance.
- Scheduled Ethernet Fabric: Can totally eliminate packet losses and offer predictable tail latency
Listen on your favorite platform
Listen on Apple Podcasts
Listen on Spotify
Listen on Youtube
Full Transcript
Hi and welcome back to CloudNets, where networks meet cloud.
And today we’re going to talk about AI.
And not just AI, we’re going to
talk about the importance of latency in AI. And we have our late expert latency.
We couldn’t resist it. Our latency expert, Sani.
Thank you for joining Sani.
Thank you for having me, Dudy.
And sorry for my latency.
Okay, apology accepted.
We’re going to talk about latency in AI networks.
What is it, why is it important and how can we maintain it?
What are the three things we need to know about latency?
Right, so first we need to understand that latency in traditional networks is being measured, being addressed in a normal way.
However, AI networks introduce new challenges that need different treatments of latency.
Three Types of Latency
And, and let’s look on the three types of latency that we see in networks.
– Head Latency
So the first one is the head latency. Head latency is actually the first packets that will arrive that has the lowest delay in the network.
Okay, this is easy.
This is easy.
It can be measured when there is no load on the network, which is not typical for AI workloads.
– Average Latency
The second one is the average latency. The average latency is actually the mean calculation over time of latency. And, and it includes all the time. That it takes all the packets to run.
Exactly. So it holds some information about the network performance. However, it doesn’t tell you the full story.
It’s like the typical packet.
– Tail Latency
And that gets us to the third one, which is the most important one in AI workloads.
This is the tail latency.
The tail latency is actually the slowest packet that arrives to the destination from all the packets.
The last packet that arrived defines the tail latency. Exactly.
Tail Latency is Critical
Okay, so the first point is that there are different types of latency.
Now let’s talk about which latency is the most important.
We mentioned tail latency, why is it important?
Exactly.
So in AI workloads, there are a lot of compute resources that are doing parallel computing and they’re getting a lot of data.
This is a data heavy network.
So all this compute is being done, being sent over the network and, and arriving to the destination.
So this is the parallelism process of the compute.
Okay.
All the ecolectic communication.
Okay, definitely.
So when all this process ends, then only after all the data arrives to the destination, the next task can start.
So basically the compute waits for everything to arrive.
Some GPUs may be idle at this time until the last packets arrive and only then continues.
Okay, so tail latency is very important.
Because it defines how fast the workload can progress.
Exactly.
If it’s not optimized and it’s high, actually, the compute is waiting for the network.
Okay, and we don’t want that.
We don’t want that.
Okay.
So tail latency is the most important parameter.
– Scheduled Ethernet Fabric
Okay, so now let’s talk about how can we reduce the tail latency?
We can optimize it.
Right.
So we at DriveNets are addressing this point and we actually have an innovative solution that we call Scheduled Ethernet fabric.
And we are doing different multiple steps in order to optimize the latency.
It’s actually a strategy of how to handle the latency in the AI network.
So this is the same solution we talked about earlier when we talked about the ingress packet being cut into cells and sprayed across the fabric.
It actually means that what, all the packets are arriving at the same time, Right?
So give or less, all the packets are arriving in a predictable time with low variation.
So as we can see in the graph, when comparing standard Ethernet to Schedule Ethernet, we can see the differences in the head latency and in the tail latency.
In scheduled Ethernet, you can see the improvement where the latency is predictable and the variance is very low, unlike normal Ethernet.
So this will have a dramatic effect on the job completion time and the performance of the AI network.
So basically, even if you intuitively think you have deep buffers, you add latency, et cetera.
What is, what is important is the tail latency.
And because of the low variation, the tail latency is basically fixed or very, very low.
Exactly.
So this is very important for AI networks not to stay in the frame of traditional networks and just measure the element latency.
Here it’s much more important and a strategy that actually do a trade off a little bit about the head latency, but gives you a huge benefit on the tail latency, ensures that the GPUs are constantly working, they have no ideal time, and the job completion time, which is the most critical parameter in AI workloads, is being improved dramatically.
Okay, great.
So three things we need to remember.
This is very important when you build an AI infrastructure.
Three things you need to remember about latency.
First, there are multiple types of latency.
So beware at what you’re looking at.
Is it the head, the average, or the tail latency?
The second thing you need to remember is the tail latency is the most important one, because this is the one that defines how much time the GPUs are awaiting network resources.
And it dramatically affects the job completion time and the overall performance and utilization of the cluster.
The third thing is that the good news is that there is a solution.
The solution of a scheduled fabric, a fabric that connects the GPUs that is scheduled and can assure that all of the packets or most of the packets are arriving around the same time.
So the jitter, the delay latency is very, very low.
Means that even if the head or mean latency are a bit higher, the tail latency is fixed and is much lower than in any other solution.
This goes to better job completion time, better utilization of your GPUs, and your money in general.
Thank you very much Sani.
Thank you very much, Dudy.
Thank you very much for watching.
Stay tuned for the next episode
of CloudNets.
I’m.
I’m late.
I have to go.
Data Center Interconnect (DCI)
Data centers that are moving towards supporting AI workloads, which means an explosion in capacity and needing more infrastructure to support this demand. The traditional very large chassis is not enough anymore. Network operators need limitless capacity, as well as a flawless, lossless performance to meet AI performance. They need to support Layer 3 traffic. And all of these are featured in the Distributed Disaggregated solution available from DriveNets.
CloudNets-AI: What are the 3 changes that AI made to DCI?
The three changes that AI made to DCI are capacity, performance, and one environment and layer three capabilities.
Key Takeaways
- Increased Capacity Needs: The AI boom has led to large-scale workloads involving thousands of GPUs, generating substantial traffic. Consequently, DCI solutions must offer enhanced scalability to manage these increased traffic flows effectively.
- Enhanced Performance Requirements: AI applications demand lossless connections between data centers to ensure optimal performance. This necessitates DCI solutions with deep buffering capabilities to handle high-performance needs without packet loss.
- Layer 3 Capabilities and Unified Environment: Modern DCI solutions should function as routers, capable of managing numerous eBGP connections. This ensures seamless integration and communication across interconnected data centers, highlighting the importance of advanced Layer 3 capabilities in the AI era
Full Transcript
Hi and welcome back to CloudNets, where networks meet cloud.
Today we’re going to talk about DCI, Data Center Interconnect.
No, no, don’t go.
I know it seems like a boring subject, but DCI is going through something and this something is called AI.
And we have Shai, our interconnect and AI expert.
Again, thank you Shai for coming.
Thank. Thank you for having me.
3 changes that AI made to DCI
So Shai, what are three points or three changes that AI made to DCI and what do we need to do with the new DCI requirements?
So we have three things that we need to remember as you said.
First of all, we need to talk about capacity.
Okay.
Secondly, we need to talk about performance.
And thirdly, we need to talk about one environment and layer 3 capabilities.
Okay.
Okay.
1 Capacity
So let’s start with the first one with capacity.
Okay.
We all feel the AI boom that we have right now.
This means that those large scale workloads with thousands of GPUs generate a lot of traffic.
And we need to have DCI solution with enough scalability to handle those traffic flows.
Okay.
This is one no longer single chassis is enough for all CI needs.
You need much more than that.
Okay, what about performance?
Yeah.
2 Performance
Secondly, we have performance.
No gigas.
Performance, AI, great performance.
This means you need a lossless connection between one data center to the other.
So you need deep buffering capabilities in the DCI.
Something that many of the DCI solutions that we have right now doesn’t have.
Okay, we talked about it a bit when we talk about AI workloads and the job completion time performance.
Now it is expanding to the DCI and we feel the heat here as well.
One environment and layer 3 capabilities
Third thing was one environment.
One like layer 3.
You need a router.
Basically.
Yeah.
You need the router capabilities like for example, 1000 EVGPs.
You need to handle those thousands of EVGP connections and you need a real router to handle this.
This means that you need a solution that can handle all those three points capacity.
You need to have something that can handle the lossless connection and have one capability.
Let’s think about solutions.
No such a solution.
I don’t know.
DDC!
No.
Really?
Okay.
So we’ve been talking about it for 4 seasons.
Yes.
But now DDC is a good fit for this.
Yeah.
Imagine that.
Yeah.
Okay, so velocity.
Yeah, it can do it.
It’s in scale, basically.
Yeah.
Secondly, we have the performance, it’s not.
The scheduled fabric rebound offerings.
And what better author do we have than DDC?
The best.
Okay.
Okay.
Wow.
This was an amazing revelation.
So, three things you need to remember about the new DCI.
The DCI in the AI era, the DCI that connects data centers that are moving towards AI.
One is an explosion in capacity, needs no more.
One large, very large chassis, it is not enough anymore.
I think there are some operators that do eight chassis,
and still it is not enough.
So you need limitless capacity.
You need a flawless, lossless performance.
AI performance.
We talked about it a lot.
And you need layer 3 because you need EVGP, et cetera, et cetera.
And all of these exist in the DDC solution available from DriveNets.
So.
Okay, this looks nice.
Okay, thank you very much, and, for joining again.
Thank you for watching.
See you next time on CloudNets.
Bye.