Resources

Cloud Nets VideosNovember 14, 2023

Season 3 Ep 12: Avoid congestion in AI workloads

How do you best deal with congestion in AI workloads?

Congestion demands attention, otherwise it can result in higher latency and packet loss. The main dilemma around congestion is around avoiding it or mitigation. We’ll look at how scheduled fabrics make your AI infrastructure lossless and predictable, without bringing additional technologies to mitigate the congestion.

Listen on your favorite podcast platform

Listen on Apple Podcasts
Listen on Spotify

Full Transcript

Hi, and welcome back to CloudNets-AI, our special miniseries spinoff of CloudNets, in which we talk about AI, but into a greater detail. Yeah. And teach you, hopefully, some things.

So today we’re going to talk about congestion and the two methods of dealing with them. One is dealing with them and one is avoiding them. Like Einstein said, wise men avoids a situation that a smart man would not have. But you get our grip. If you have an issue, you can avoid it altogether, or you can wait for it to happen and then deal with it. And what we are trying to do is to be the first one.

So we eliminate congestion altogether with the AI networking fabric, instead of waiting for congestion to happen, which it will eventually, when it comes to Ethernet, and then invent some mechanism to mitigate it. Some of them are very good, but still, if you avoid it at the first place, it is better because, and this is where Run comes in and explains, let’s put it this way, when it comes to a network, the idea of a best effort always kind of persists. That’s the basic principle. You send traffic and you assume that everything is going to be all right. And then when something fails, and something always fails, definitely in large infrastructures, then you start to react onto it. And now everything kind of rotates around. How fast is your reaction? How good is your way of touching or collecting the inputs, all the indicators of what’s going on in the network. From the very basic risk run Smith mechanism, and onto very sophisticated telemetry systems. Alternative routes, resending of the packet, collecting of telemetry, tracking of buffers or buffer indicators of to what extent the buffer is consumed at a given time. All of that are methods in order to understand that there’s something happening currently in the network and then reacting. Let’s try to solve it. Let’s try to fix the problem after something already cracked. Okay?

And on the other hand, when it comes to a cell based fabric or. A scheduled fabric, so the logic when it comes to a scheduled fabric is let’s not send the traffic before we have a guarantee that that traffic is valid, that it’s available for the network to absorb it throughout the entirety of the path of that fabric, which is basically the concept behind chassis. Right? This is how a chassis is built. A chassis, let’s kind of zoom out.

A chassis is a device which was built to be fit into a network. So the network has all sorts of network behavior, but every network component is expected to work as a guaranteed device. Exactly. So when you build a chassis, although there are multiple components, inherently, you build all these mechanisms into the chassis, because again, it needs to behave like a network component. You don’t have a congestion on the chassis back plane. Right. That’s exactly what that internal mechanism solves to you. Because again, a chassis is one network element in a larger network. Okay, so let’s talk about this mechanism which was implemented in a chassis, and now it is implemented across the entire network with the DDC. The logic is that there is a scheduling mechanism. It’s called a virtual output queue. There is an indication from the output device, the receiving side of the traffic, indicating to the transmitting side that the network in its full is capable of accepting this amount of traffic, and only then traffic is being transmitted into the network. There is a situation called head of line queuing. When you have multiple devices sending into one device, congestion is not caused by one entity, but by multiple entities, all sending even not to a specific destination. But there is a junction somewhere in the network which absorbs a lot of traffic, and then that traffic spreads, but that junction This is the noisy neighbor problem. Yeah, and then when you kind of propagate that congestion back to the sending side, when you’re just blindly blasting all of the senders that there is a congestion, then you might impact a good neighbor because traffic, because noise is coming from another neighbor, right. And you want to impact that neighbor only. So that’s kind of the scenario known as a noisy neighbor.

When you have an inherent VOQ virtual output queue mechanism built into your fabric or network, then you avoid both these problems, head of line queuing and a noisy neighbor. And you also better utilize the fabric because you do not send and waste fabric resources on a packet that will not be able to fulfill its journey. Right? Right. One of the criteria for a good solid network is that cross bisectional bandwidth, how much actual bandwidth is running through that aggregation layer of the network. One level is to bring that cross bisectional bandwidth level higher. When you’re sending traffic, you count that traffic as part of the cross bisectional bandwidth when it gets to the end and then it gets dropped and it needs to be retransmitted or even cause a backlash traffic going back to the source, you exceed the capacity of the cross bisectional bandwidth. You count it, but it’s useless traffic. Right. It’s a false utilization of the cross bisectional bandwidth. So you only want to send It’s throughput not good put. It’s throughput versus good put. Exactly. So you want to kind of keep that cross bisectional bandwidth measurement, a real measurement, and not just false good indicator of something which is essentially not working well. So this sounds optimal because we avoid a problem which we would have a very hard time resolving, and we get great performance.

Is there any trade off? For instance, what about latency? Because if you measure specific ASIC latency, you might see lower results on ASIC that was aimed for a network and not for a chassis. But is it the right way to look at it? To a certain extent it’s somewhat true. You could measure the lowest possible outcome of a latency and then a fully scheduled mechanism might push that minimum latency a little bit higher. But when you’re looking at the application, and explicitly for AI networking, that’s not the way to measure latency. You want to measure the tail latency, like the last traffic that traverses through the network and results at the receiving end. Where a calculation is being done is always the one that dictates when will the calculation begin. Right? So you need to get all of. Your, for that matter, this is what affects JCT job completion time in AI workloads. So what you want to measure when you’re measuring latency is in fact the jitter. What’s the difference between the lowest latency and the highest latency? That’s the count that matters the most. And when you have a fully scheduled mechanism like a VOQ, then that latency variation, the jitter is minimized. Right? And so you can measure, in certain scenarios, you can measure a very low latency when you’re using kind of a plain Ethernet network or a very high latency. So even if one packet arrives earlier, still your workloads wait for all the packets to arrive. It’s meaningless. It’s a basketball team. First player is in, meaningless. Okay, thank you very much, Run.

This was our pitch about congestion avoidance versus mitigation. We talked about better utilization of the fabric. We talked about VOQs ensuring that the packets can come through to the destination port. We talked about latency. And in general, we talked about how scheduled fabrics make sure that your AI infrastructure is lossless and predictable versus additional technologies that handle congestion and do not avoid it. There are lots of development around this area of solving the congestion once it happens, avoiding it simply makes it better work. Yeah. Thank you very much, Run. Thank you for watching. We’ll be back with more with CloudNets-AI. See you then.