This interchangeability enabled the growth of parallel computing, in which a large number of compute elements (mainly GPUs) work, in parallel, on the same computational job. This cluster of servers is, basically, a very large computer (or, a supercomputer). The intra-server I/O protocols (such as PCIe and NVLink) and inter-server I/O protocols (such as Ethernet and InfiniBand) take a similar and equally important part in this.
It turns out, however, that there is a new gap that develops in this architecture, as described by Meta’s VP of Engineering, Infrastructure, Alexis Bjorlin, in her keynote session last year at the 2022 OCP Global Summit.
This new growing gap is between compute capabilities/capacity (in FLOPS) and the bandwidth of memory access and interconnect, as shown in the timeline graph below.

This gap makes the network, yet once again, a bottleneck.
This bottleneck becomes acute in systems in which the compute process is heavily dependent on the inter-server, or inter-GPU, connectivity. This is the case in AI clusters, especially in large-scale ones, in which this networking performance lag causes GPU idle cycles.
In this case, the phenomenal HW FLOPS growth is degraded by networking. If you have an extremely powerful GPU, it is a shame to see it idle for up to 50% of the time as it awaits information from another GPU in the same cluster. This delay is due to latency, jitter or packet drops in the interconnecting network.
As mentioned, this is extremely important in AI cluster networking. Fortunately, there are several solutions that can somehow ease this pain, some of which are more suitable than others.
And we’ll discuss it next week.
Stay tuned!
Download White Paper
NCNF: From Network on a Cloud to Network Cloud