Home Resources CloudNets Video Season 4 Ep 5: Tail Latency

Season 4 Ep 5: Tail Latency

Tail Latency in AI Networks

What’s the importance of latency in AI networks. It’s time to rethink our approach to latency – not as an inevitable limitation but as a solvable challenge. Latency – the delay in data transfer between systems – is a critical factor in AI back-end networking. In AI back-end networking, different types of latency metrics exist, including head, average, and tail latency. Understanding these latency types and their effects on packet loss and packet retransmission is essential for optimizing AI system performance.

CloudNets S4E5: How do you optimize tail latency?
The three things you need to remember about latency are that they there are multiple types of latency, that tail latency is the most important one, and that the solution is a scheduled fabric – a fabric that connects the GPUs that is scheduled and can assure that all of the packets or most of the packets are arriving around the same time.

Key Takeaways

Types of Latency:In AI Back-End Networking, there are different kinds of latency that can impact AI workloads: head latency, average latency, and tail latency
Tail Latency is Critical:This is particularly important in inference tasks. High tail latency can create delays and inconsistent user experiences, ultimately limiting model performance.
Scheduled Ethernet Fabric: Can totally eliminate packet losses and offer predictable tail latency

Listen on your favorite platform

Listen on Apple Podcasts
Listen on Spotify
Listen on Youtube

Full Transcript

Hi and welcome back to CloudNets, where networks meet cloud.
And today we’re going to talk about AI.
And not just AI, we’re going to
talk about the importance of latency in AI. And we have our late expert latency.
We couldn’t resist it. Our latency expert, Sani.
Thank you for joining Sani.
Thank you for having me, Dudy.
And sorry for my latency.
Okay, apology accepted.
We’re going to talk about latency in AI networks.
What is it, why is it important and how can we maintain it?
What are the three things we need to know about latency?
Right, so first we need to understand that latency in traditional networks is being measured, being addressed in a normal way.
However, AI networks introduce new challenges that need different treatments of latency.

Three Types of Latency
And, and let’s look on the three types of latency that we see in networks.

– Head Latency
So the first one is the head latency. Head latency is actually the first packets that will arrive that has the lowest delay in the network.
Okay, this is easy.
This is easy.
It can be measured when there is no load on the network, which is not typical for AI workloads.
– Average Latency
The second one is the average latency. The average latency is actually the mean calculation over time of latency. And, and it includes all the time. That it takes all the packets to run.
Exactly. So it holds some information about the network performance. However, it doesn’t tell you the full story.
It’s like the typical packet.

– Tail Latency
And that gets us to the third one, which is the most important one in AI workloads.
This is the tail latency.
The tail latency is actually the slowest packet that arrives to the destination from all the packets.
The last packet that arrived defines the tail latency. Exactly.

Tail Latency is Critical
Okay, so the first point is that there are different types of latency.
Now let’s talk about which latency is the most important.
We mentioned tail latency, why is it important?
Exactly.
So in AI workloads, there are a lot of compute resources that are doing parallel computing and they’re getting a lot of data.
This is a data heavy network.
So all this compute is being done, being sent over the network and, and arriving to the destination.
So this is the parallelism process of the compute.
Okay.
All the ecolectic communication.
Okay, definitely.
So when all this process ends, then only after all the data arrives to the destination, the next task can start.
So basically the compute waits for everything to arrive.
Some GPUs may be idle at this time until the last packets arrive and only then continues.
Okay, so tail latency is very important.
Because it defines how fast the workload can progress.
Exactly.
If it’s not optimized and it’s high, actually, the compute is waiting for the network.
Okay, and we don’t want that.
We don’t want that.
Okay.
So tail latency is the most important parameter.

– Scheduled Ethernet Fabric
Okay, so now let’s talk about how can we reduce the tail latency?
We can optimize it.
Right.
So we at DriveNets are addressing this point and we actually have an innovative solution that we call Scheduled Ethernet fabric.
And we are doing different multiple steps in order to optimize the latency.
It’s actually a strategy of how to handle the latency in the AI network.
So this is the same solution we talked about earlier when we talked about the ingress packet being cut into cells and sprayed across the fabric.
It actually means that what, all the packets are arriving at the same time, Right?
So give or less, all the packets are arriving in a predictable time with low variation.
So as we can see in the graph, when comparing standard Ethernet to Schedule Ethernet, we can see the differences in the head latency and in the tail latency.
In scheduled Ethernet, you can see the improvement where the latency is predictable and the variance is very low, unlike normal Ethernet.
So this will have a dramatic effect on the job completion time and the performance of the AI network.
So basically, even if you intuitively think you have deep buffers, you add latency, et cetera.
What is, what is important is the tail latency.
And because of the low variation, the tail latency is basically fixed or very, very low.
Exactly.
So this is very important for AI networks not to stay in the frame of traditional networks and just measure the element latency.
Here it’s much more important and a strategy that actually do a trade off a little bit about the head latency, but gives you a huge benefit on the tail latency, ensures that the GPUs are constantly working, they have no ideal time, and the job completion time, which is the most critical parameter in AI workloads, is being improved dramatically.
Okay, great.

So three things we need to remember.
This is very important when you build an AI infrastructure.
Three things you need to remember about latency.
First, there are multiple types of latency.
So beware at what you’re looking at.
Is it the head, the average, or the tail latency?
The second thing you need to remember is the tail latency is the most important one, because this is the one that defines how much time the GPUs are awaiting network resources.
And it dramatically affects the job completion time and the overall performance and utilization of the cluster.
The third thing is that the good news is that there is a solution.
The solution of a scheduled fabric, a fabric that connects the GPUs that is scheduled and can assure that all of the packets or most of the packets are arriving around the same time.
So the jitter, the delay latency is very, very low.
Means that even if the head or mean latency are a bit higher, the tail latency is fixed and is much lower than in any other solution.
This goes to better job completion time, better utilization of your GPUs, and your money in general.
Thank you very much Sani.
Thank you very much, Dudy.
Thank you very much for watching.
Stay tuned for the next episode
of CloudNets.
I’m.
I’m late.
I have to go.