Resources

Fyuz 2023

FYUZ23: DriveNets Network Cloud-AI

At FYUZ23, DriveNets’ senior sales engineer, Juan Rodriguez, takes a look at a DDBR architecture purposed for high performance AI networking, the DriveNets Network Cloud-AI solution. With hyperscalers starting to feel that connectivity solutions for GPUs are falling behind, and not performing at the required scale, the key KPI in AI networking, the job completion time (JCT), is negatively affected. They need a solution that scales massively, with thousands of ports interconnecting GPUs in the biggest AI environments.

At DriveNets, we say that Distributed Disaggregated Chassis (DDC) or DDBR can scale practically without limits, under tests in these type of environments to numbers that are unprecedented. For AI networking, for AI connectivity, we’re talking about 32K 800 gig ports. So it’s huge!

Full Transcript

And now let’s deep into the world of AI. And please welcome Juan Rodriguez Martinez.
Okay, thank you very much, Hello everybody. Those of you who may be a little bit disappointed not seeing my colleague Run Almog here, he also had issues traveling, so I will be replacing him.

My name is Juan Rodriguez.
I’m working here in Europe in the sales engineering team from DriveNets. What I want to talk about a little bit today is on AI networking.
Connectivity solutions for AI deployments, for massive AI deployments.

Before that. A little bit about DriveNets.
DriveNets is a very young company, but has been, let’s say accompanying operators in this disaggregation journey for several years already.
And I can tell firsthand because I was working in Telefonica not so long ago and participating on TIP from Telefonica side.
What TIP has done is to provide use cases to provide telco requirements so that disaggregated solutions know what to expect can adapt to the needs from telco operators.
This work is not so old, but I want to remind or let you know that it’s relying on previous work done on Open Compute Project There is a definition for DDC distributed desegregated chassis, published and open to everybody by AT&T, which defines a very robust architecture, hardware architecture, to be able to support the use cases of telco.

So what TIP did is just to work on top of that and provide all the requirements. We have been hearing from Jose Angel, all the different groups, aggregation, DDBR, cell site router, it’s just translation of the requirements of the operators into this architecture.
This architecture also Ian was talking about the concern about adopters of these technologies, about maturity, right? So this architecture has been deployed, working on AT&T. They speak very openly about it.
More than half of the core of AT&T’s network is running over this type of solutions using DriveNet software.

The idea is to show you today how we can evolve this concept of DDC, of DDBR and also apply it to AI networking.
Unfortunately or luckily, we don’t have a very strict definition by somebody on what AI networking is. So what I’m going to talk about is the requirements that we are seeing from the field, not only from the big hyperscalers on the left, of course they are working on AI, but also very new companies that are appearing talking about the metaverse
generative AI.
The number of applications that we are seeing is huge and we have not even scratched the surface, right?
We don’t know where this will lead us, even telco operators that we are working with mostly in Asia, are starting to try to understand how they can position, how they can use AI to improve the services or the cost,
or to offer new experiences to their customers. And the amount of investment that we are seeing is actually huge. So all the requirements that I’m going to talk about come directly from the field. And the challenges that all these stakeholders are seeing have to do mainly with the GPU. This is the key component in AI deployments. They need to maximize the time that the GPUs are working, that are coping with workloads. Okay. Any
millisecond that the GPU is not doing its work is money that is being lost, wasted from those billions of dollars that appear in the previous slide.

So what they are starting to feel is that the connectivity solutions for those GPUs are starting to fall behind.
They’re starting not to perform at the required scale. They’re starting to affect the key KPI in AI networking, which is the job completion time.
If you have GPUs idle, you’re losing money.
Of course, some of the requirements that we are seeing are having to do with openness. They need solutions that are interoperable, that they can really choose what is the best approach for them.
They need solutions that scale massively.
So we are talking about thousands of ports that are interconnecting GPUs in the biggest environments for AI.
So we normally say at DriveNets that DDC or DDBR can scale practically without limit.
This is putting under tests on these type of environments to numbers that are unprecedented.
First solution that we are building for AI networking, for AI connectivity, we’re talking about 32K 800 gig ports.
So it’s huge.

From requirements come the challenges, right.
To maximize the time that the GPUs are doing their work to minimize the job completion time.
What they need is a connectivity solution that is lossless.
Of course, if you need to retransmit a packet, it’s a packet that didn’t reach its GPU. It’s a packet that is not being worked with and you need predictable behavior. Okay.
It’s not so critical, a really slow delay. It’s more critical, a slow jitter, delay that is predictable and always within some ranges.
Apart from this, well, everything that we’ve heard before, standard interfaces, interoperability, maturity, but the key topic is scale with performance.
So I will talk later on on potential solutions that we are seeing that are being positioned for these type of applications and how we believe that DDC/DDBR can cope better than those in this type of scenario. So, going a little bit backwards, I was talking about the definition of OCP for DDC.

Again, this is a very robust hardware architecture based on white boxes.
In this case, using Broadcom technology. We have a leaf and a spine type of topology with the spine as the fabric nodes with the equivalent to the line cards in the leaves. So that connectivity between each pair of white boxes is achieved on a single hop through the fabric.
That architecture already solves the issue of the jitter.
Every couple of ports is at the same distance, it’s just one hop away.
So jitter is very predictable and under control. But it’s not only that, it’s a matter of scale. If you need to scale this type of solution, you just add more boxes. And if you need more line cars, just put more fabric. Keep on adding white boxes.
You can have as many ports interconnected on an infrastructure that by the way can be deployed as easily from the physical point of view as any deployment of switches in a data center, because these are small type of boxes, one rack unit, two rack units, but that behave like a single entity for management. So simplicity is also there.

As I said, what TIP did is create use cases. Okay, core nodes, edge node, Internet gateway, now aggregation very soon, but it’s just adding the requirements from the telco point of view, it’s just adding the need for
openness for scale. Then the type of protocols that need to be put into these boxes to build routers. So we are talking about the segment routings, the different BGPs, ISIS, all that is put into DDBR.
What we believe, and it’s what we are working at, is that that very same infrastructure can also be used to interconnect GPUs in the data centers.
Okay? It complies with the scale. As I said, you just need two or more boxes.
It complies with the simplicity, it behaves like a single entity, it complies with the avoidance of vendor lockin. The specs are open.
The idea behind the Telecom Infra Project is precisely openness, and it’s well proven.
As I said, it’s not only DDVR starting to grow.
Last year we had here Turkcell announcing the field trial in their network.

Very recently KDDI also publishing openly talking about first deployment in Japan.
It’s been running on AT&T for many years already. So it’s a very reliable and trustable solution. What are the alternatives? Right.
What is being positioned as of today? First solution I want to talk about is InfiniBand. This is extremely common in HPC environments. It definitely has the performance and the scalability. But problem is, apart from not being open and being proprietary, is not so flexible.
It copes well with certain types of traffic. And the adoption curve is kind of tricky for operators. It’s something that they are not used or typically in HPC environments. This is also not open to the Internet, which is obvious demand from AI environments, and it’s yet to see how it can cope.

Second type of solution is Ethernet. Right? It’s definitely scalable. You can throw in switches in a data center. What happens when you throw too many? Just the performance is not there. In terms of jitter delay, you have packet loss, again, is probably not the most suitable type of application.
And also we have a chassis. Okay, chassis are very common in the financial market. They definitely have the performance. Everybody is used to working with chassis and router, but the scalability is just not there. It’s limited by a physical chassis that doesn’t allow it to scale to the numbers that we were talking about. So to close all these gaps, again, the figure that we were talking, we still have our leaf and spine type of topology, we still have the same solution, the same lossless solution, the same predictable solution, where we connect lots of GPUs with the new technology Jericho 3 and Ramon 3, these are the scales that we are achieving. 32,000 ports of 800 gig, and as easy to deploy as switches, as easy to operate as a single router, not a network of routers. Here, if you connect 32 ports providing service to 32 GPUs, the infrastructure is still a single entity.

Final slide, a few numbers here, not from us. In terms of savings, this comes from hyperscaler is actually testing the solution. What they have found is that they can reduce the job completion time with these type of technologies on a minimum of 10%. This is precisely the same number that typical connectivity solutions in these types of environments is assigned in terms of cost to connectivity.
So if you have a solution that reduces job completion time by 10%, that means that to do the same job, you can buy 10% less GPUs. That means the connectivity solution is paid by itself, it’s for free. Okay, this is
what they’re seeing. The good thing is that the number of changes that we have to introduce to the architecture compared to the OCP project is zero.
Right? Maybe we need to increase the number of functionalities that are used to interconnect the GPUs.
Maybe we have to close some gaps in terms of monitoring the infrastructure. But from the physical point of view, there are zero changes compared to the original specification on OCP, if any, we can just remove some of the parts of the white boxes that are not really needed. For example, the TICAM, because these are not required in this type of environment. Okay, finally, and again, if you need to build a solution that is as less risky as possible. This has been in the network for several years already and we can discuss about it anytime.

You’ll find me around or at the booth. Welcome to discuss about it. I assume we have time for a couple of questions if somebody wants to do it.

Okay. Either it went really bad or really good. But in any case, thank you very much, very much. Appreciate it.
Thank you.