Resources

Cloud Nets VideosNovember 7, 2023

Season 3 Ep 11: Cell-based AI Fabric

Why is cell-based AI fabric so much better than the alternative?

Ethernet is a lossy technology. What one needs is a lossless and a predictable fabric. That will still be Ethernet towards the external needs to connect standard-wise to any other peripheral equipment. But on the inside it has to be something predictable and the implementation of a cell-based fabric meets that function and the cell-based fabric changes the way one ‘sprays’ the incoming packets between the different spines.

Listen on your favorite podcast platform

Listen on Apple Podcasts
Listen on Spotify

Full Transcript

Hi and welcome to Cloudnets AI.

This is a special miniseries spin off of CloudNets in which we’re going to talk deeper into AI and AI infrastructure and we have our very own chat bot, Run.

Hello our AI specialist.

Today we’re going to talk about cell based AI fabric.

We’ve been talking about that for cloud AI and how a cell based network fabric is better performing than the alternatives.

But today we want to touch on why and how, How come?

So yeah, how, how is different than any other Internet based alternative.

OK.

Well, let let me start with the fact that Ethernet is a lossy technology.

When connecting multiple Ethernet devices one to the other, you create a network.

That network is essentially lossy.

It’s a best effort idea.

You send the traffic and you hope for the best and you have all sorts of mechanisms to kind of back up in case that traffic got lost or or what not.

What you need is for AI workloads.

What you need is a lossless and a predictable fabric.

OK that will still be Ethernet towards exactly, it would still be even towards the outside world needs to connect standard wise to any other peripheral equipment.

But on the inside of it it has to be something predictable and the the implementation of a fabric based cell based fabric is one that implements that That function OK and and the cell based fabric is also changing the way we spray the incoming packets between the different spines.

We we do we do not do hashing and bouncing we actually do spraying which is equal and even this is this is this is an approach where in Ethernet you typically get a a load of traffic multiple packets or multiple different flows and then you kind of you apply a hash function on top of this and then you distribute or hash actually the the you select an aggregation device by which that traffic is going to be traversed to the the receiving the receiving end that’s that’s known as as hashing.

When you have hashing and then you have different types of flows sometimes you have very large flows known as elephant flows so large elephants and small mice.

So you can have a kind of a lack of balance between the different aggregation devices inside your network.

So the result would be that one aggregation device can be exhausted, another could be left idle and the entirety of the network will be perceived as if there is a congestion situation.

Traffic though it has enough even though the resources can be as low utilized as 10% and still you will see you will see losses.

And this is not something you can afford when you’re talking about AI workload.

You invest a lot into your network and then you only utilize a small portion of it.

What what the cell based approach enables you to do is to spread the traffic across all of the multiple spines or aggregation devices that you have.

So you can keep the utilization level of all these devices on the exact same level.

When traffic is increasing, the entire network increases.

That’s what’s known as a cross bisection or bandwidth right.

So everything kind of goes up as as 1 plateau or goes down as one plateau.

So it’s it’s a nearly perfect load balancing.

close to perfect.

Yeah, perfect balancing method.

That’s the idea.

OK that’s cool.

And and this basically allows you to be predictable regardless of the overlay application, Correct.

Right.

That’s that’s like the the third well potentially I would say that’s the the outcome of this.

When you’re running an AI workload, you don’t know what’s going to be the application that’s running on top of it.

And and even more so, even if you know what kind of applications you’re running today, you will not know what are the applications that you’re going to be running in the near future because there are a multitude of researchers working on this.

You build a billion dollars worth of of an infrastructure and then you have lots of researchers working on top of this.

Sometimes you even outsource that infrastructure as a cloud resource so you don’t even know who is there that researcher.

So the next Gen.

application that’s going to run on top of it is definitely by definition an unknown.

So and you don’t want to fine tune your network according to each new application.

It goes, it goes, it goes beyond.

You don’t want to, you cannot.

It’s practically impossible to have a tuning team which is working directly towards so many different researchers.

Everybody’s doing the different thing and you don’t know what’s going to happen tomorrow.

So it’s practically impossible to kind of fine tune.

You need to have something which is agnostic to the type of flow without any surprises in the network.

OK.

So I think it makes sense and I think we came up with three main points as as we typically why why cell based fabric is so much better than the alternative when it comes to Ethernet for AI fabric or AI infrastructure.

The 1st is that it’s lossless as opposed to the lossy nature of an Ethernet network as opposed to a single node.

The the 2nd is that we have near perfect balancing so we can utilize the infrastructure practically and spread everything evenly across the different spine and and the third one is that we are actually agnostic to the overlaying application or workload and in that manner we are future proof because the next workload will behave differently by definition than the current ones.

The network will have the limit in the research or the innovation.

Exactly.

OK.

Thank you very much Run.

Thank you for watching.

See you next time on CloudNets-AI.