It was a great opportunity to get up to speed with the latest industry innovations, in particular hyperscalers and leading vendors’ latest AI and data center developments.
Here’s what I took from the event, mainly from industry leaders such as Microsoft, Meta, Intel, Nvidia and Broadcom.
Microsoft: Trends driving innovation – generative AI, security, and a holistic view of infrastructure
According to Bryan Kelly of Microsoft, there are five key trends that will fuel the next wave of innovation. These include: generative AI; two aspects of security – quantum resilient cryptography and confidential cloud; and what was most interesting for me, two aspects of infrastructure – physical infrastructure and composable architectures.
The composable architectures concept is a holistic view of infrastructure and innovation. It extends the term “full stack” from just talking about software to a more holistic view that includes hardware and optics networking, as well as sustainability and security.
This puts all the components of the system together into a composable architecture that needs to be optimized, which is especially important in generative AI clusters that are growing bigger and bigger.
Meta: Larger datasets call for larger training clusters
The reason that those AI training clusters are growing bigger and bigger was detailed by Meta’s Andrew Alduino, who mentioned the launch of two 24K GPU clusters by Meta this month.
It turns out that training an LLM (large language model) for more time/days can take you only so far. But to really improve training results, you need a larger (and larger) dataset. When training a larger dataset, though, you need a larger training infrastructure (GPU cluster), which leads to an I/O bottleneck.
Intel: Performance at scale
Saurabh Kulkarni of Intel also talked about the trend involving the growing size of GPU clusters. In fact, we should stop looking at GPUs or servers as the basic unit of compute, and rather look at the complete data center as the basic unit, or component. So, the computer is no longer the computer…
Broadcom: The network is the computer
According to S. Kamran Naqvi of Broadcom, the network is the computer.
There’s been a major shift in how applications and hardware intertwine. Traditionally, multiple apps run on a single CPU, as in any computer and server (and cloud instance) we know. Now, with AI, a single app (or job) runs on multiple compute instances (e.g., GP-GPUs), which is why generative AI (gen AI) clusters are growing in size.
This leads us to the fact that an important part of the infrastructure is the one that allows those GPUs to work together as a single compute resource – namely the network.
NVIDIA: The network defines the data center
Gilad Shainer of NVIDIA also highlighted the importance of networking. In fact, he asserted that “the network defines the data center.”
Broadcom and ByteDance: Scheduled fabric performs better
When it comes to which network is best for AI workloads, S. Kamran Naqvi of Broadcom presented a test conducted by ByteDance with Broadcom’s Jericho- and Ramon-based white boxes as well as DriveNets Network Cloud software.
The outcome of this benchmark was that this scheduled fabric, based on the OCP Distributed Disaggregated Chassis (DDC) architecture, provides higher all-to-all throughput than standard Ethernet. And the performance benefit grows as the cluster size grows.
Summary of My Takeaways
A lot of information from a two-day event, but I left the OCP Regional Summit with very few major takeaways.
- Generative AI, and AI training in particular, is the next big challenge (OK, didn’t really needed to go to Lisbon for this one…)
- The network defines the data center – AI clusters need to grow as datasets grow, which makes the network more critical
- For the data centers to work well and for the GPU infrastructure to be utilized optimally, you need the best performing network, at scale. This one has to be based on a scheduled fabric, and DriveNets’ software had been proven for that.
Download white paper
Utilizing Distributed Disaggregated Chassis (DDC) for Back-End AI Networking Fabric