May 17, 2023

VP of Product Marketing

The Compute-Networking Gap: The Next Network Bottleneck – Part 2

In our last blog, we talked about external and internal I/O mechanisms becoming interchangeable as they offer (roughly) the same bandwidth. Take, for instance, PCIe 5.0; it supports 128GB/s (1024Gbps) while Ethernet supports 800Gbps, which are in the same ballpark.

The Compute-Networking Gap: The Next Network Bottleneck – Part 2

This interchangeability enabled the growth of parallel computing, in which a large number of compute elements (mainly GPUs) work, in parallel, on the same computational job. This cluster of servers is, basically, a very large computer (or, a supercomputer). The intra-server I/O protocols (such as PCIe and NVLink) and inter-server I/O protocols (such as Ethernet and InfiniBand) take a similar and equally important part in this.

It turns out, however, that there is a new gap that develops in this architecture, as described by Meta’s VP of Engineering, Infrastructure, Alexis Bjorlin, in her keynote session last year at the 2022 OCP Global Summit.

This new growing gap is between compute capabilities/capacity (in FLOPS) and the bandwidth of memory access and interconnect, as shown in the timeline graph below.

Networking Bottlenecks Part 2
Source: 2022 OCP keynote – Alexis Bjorlin, VP, Infrastructure, Meta.

This gap makes the network, yet once again, a bottleneck.

This bottleneck becomes acute in systems in which the compute process is heavily dependent on the inter-server, or inter-GPU, connectivity. This is the case in AI clusters, especially in large-scale ones, in which this networking performance lag causes GPU idle cycles.

In this case, the phenomenal HW FLOPS growth is degraded by networking. If you have an extremely powerful GPU, it is a shame to see it idle for up to 50% of the time as it awaits information from another GPU in the same cluster. This delay is due to latency, jitter or packet drops in the interconnecting network.

As mentioned, this is extremely important in AI cluster networking. Fortunately, there are several solutions that can somehow ease this pain, some of which are more suitable than others.  

And we’ll discuss it next week.  

Stay tuned!

Download White Paper

NCNF: From Network on a Cloud to Network Cloud

Read more