Getting your Trinity Audio player ready...
|
From experiments to mass produced AI clusters
Not that there weren’t any massive AI clusters built in 2024, but in the last couple of years most hyperscalers, and some enterprises, were basically experimenting with AI infrastructure. Don’t get me wrong – most of what was built (and a lot was built) was fully production-oriented. Yet in terms of AI infrastructure strategy, the big guys were still in experimental mode, building multiple clusters with different underpinning technologies (cooling, networking, cabling etc.).
I believe that in 2025 most of those architectural technologies will mature and streamline, so future AI clusters will be built with unified blueprints.
Hyperscale and/or AI cluster island consolidation
There are two different approaches in building AI clusters (mainly for training). First is the hyperscale “more the merrier” approach, claiming there is a marginal value for any GPU added to the cluster, till infinity. Second is the islands “optimal working point” approach, claiming that beyond a certain point (specifically 8K GPUs in a single cluster) the marginal benefit is negligible.
While I can’t really judge which approach is right, I believe we’ll see some consolidation around the islands approach, in which most clusters will not exceed 8K GPUs. This means there will be a lot more clusters to build.
Supply chain bottleneck shifts to power supply
The main bottleneck in the past few years was somewhere in Taiwan, specifically NVIDIA’s GPU supply chain constraints. This is what’s been slowing the still exponential growth of AI infrastructure. While this will continue as a constraint in 2025, I believe the shortage in power will eclipse it as the main bottleneck. We can see today that power supply is the main consideration when building a new data center (or updating an existing one to support AI).
AI cluster distributed across data centers
The power supply challenge also leads to a change in the way AI data centers are built, and to the massive introduction of distributed data centers. For this kind of data center, a single AI cluster is distributed across multiple data centers, in order to overcome the power supply constraint. This creates, or rather complicates, a new challenge…
Networking challenge and need for open fabric
While networking was a challenge from day one of the AI explosion era, it’s becoming a more significant challenge as this field evolves.
Distributed data centers pose a new challenge of extending high-performance, low-latency, lossless and predictable fabric to another site. But even without this, the need for a robust yet open fabric for both compute and storage becomes a significant topic for consideration when building any type of AI infrastructure. This will be an even larger area of focus in 2025.
So, will I be right? Read my December 2025 recap post to find out…
eGuide
AI Cluster Reference Design Guide