2024 OCP Regional Summit – My Takeaways

It was a great opportunity to get up to speed with the latest industry innovations, in particular hyperscalers and leading vendors’ latest AI and data center developments.

Here’s what I took from the event, mainly from industry leaders such as Microsoft, Meta, Intel, Nvidia and Broadcom.

Download now!

Fabric-scheduled Ethernet as an effective backend interconnect for large AI compute clusters

Microsoft: Trends driving innovation – generative AI, security, and a holistic view of infrastructure

According to Bryan Kelly of Microsoft, there are five key trends that will fuel the next wave of innovation. These include: generative AI; two aspects of security – quantum resilient cryptography and confidential cloud; and what was most interesting for me, two aspects of infrastructure – physical infrastructure and composable architectures.

my-takeaways-ocp-picture1

The composable architectures concept is a holistic view of infrastructure and innovation. It extends the term “full stack” from just talking about software to a more holistic view that includes hardware and optics networking, as well as sustainability and security.

my-takeaways-ocp-picture2

This puts all the components of the system together into a composable architecture that needs to be optimized, which is especially important in generative AI clusters that are growing bigger and bigger.

Meta: Larger datasets call for larger training clusters

The reason that those AI training clusters are growing bigger and bigger was detailed by Meta’s Andrew Alduino, who mentioned the launch of two 24K GPU clusters by Meta this month.

my-takeaways-ocp-picture3

It turns out that training an LLM (large language model) for more time/days can take you only so far. But to really improve training results, you need a larger (and larger) dataset. When training a larger dataset, though, you need a larger training infrastructure (GPU cluster), which leads to an I/O bottleneck.

my-takeaways-ocp-picture4

Intel: Performance at scale

Saurabh Kulkarni of Intel also talked about the trend involving the growing size of GPU clusters. In fact, we should stop looking at GPUs or servers as the basic unit of compute, and rather look at the complete data center as the basic unit, or component. So, the computer is no longer the computer…

my-takeaways-ocp-picture5

Broadcom: The network is the computer

According to S. Kamran Naqvi of Broadcom, the network is the computer.

There’s been a major shift in how applications and hardware intertwine. Traditionally, multiple apps run on a single CPU, as in any computer and server (and cloud instance) we know. Now, with AI, a single app (or job) runs on multiple compute instances (e.g., GP-GPUs), which is why generative AI (gen AI) clusters are growing in size.

my-takeaways-ocp-picture6

This leads us to the fact that an important part of the infrastructure is the one that allows those GPUs to work together as a single compute resource – namely the network.

my-takeaways-ocp-picture7

NVIDIA: The network defines the data center

Gilad Shainer of NVIDIA also highlighted the importance of networking. In fact, he asserted that “the network defines the data center.”

my-takeaways-ocp-picture8

Broadcom and ByteDance: Scheduled fabric performs better

When it comes to which network is best for AI workloads, S. Kamran Naqvi of Broadcom presented a test conducted by ByteDance with Broadcom’s Jericho- and Ramon-based white boxes as well as DriveNets Network Cloud software.

my-takeaways-ocp-picture9

The outcome of this benchmark was that this scheduled fabric, based on the OCP Distributed Disaggregated Chassis (DDC) architecture, provides higher all-to-all throughput than standard Ethernet. And the performance benefit grows as the cluster size grows.

my-takeaways-ocp-picture10 (1)

Summary of My Takeaways

A lot of information from a two-day event, but I left the OCP Regional Summit with very few major takeaways.

Generative AI, and AI training in particular, is the next big challenge (OK, didn’t really needed to go to Lisbon for this one…)
The network defines the data center – AI clusters need to grow as datasets grow, which makes the network more critical
For the data centers to work well and for the GPU infrastructure to be utilized optimally, you need the best performing network, at scale. This one has to be based on a scheduled fabric, and DriveNets’ software had been proven for that.