Getting your Trinity Audio player ready...
|
[originally published on the 650 Group blog]
Size of the AI Networking Market for Ethernet and InfiniBand
To date, AI traffic and the networking opportunity remain relatively small. However, in 2022, the AI networking market reached $2 B, with InfiniBand responsible for 75% of that revenue. As we look towards 2027, AI networking will surge to over $10B in revenue, with Ethernet exceeding $6B. Both Ethernet and InfiniBand will grow robustly during this time. At the same time, bandwidth for AI workloads will grow over 100% per year, well above the typical data center bandwidth growth in the 30-40% range annually. This is key. AI will be the most significant growth driver in the Ethernet Switch market for the rest of the decade.
AI Networking Topologies are Different
AI clusters typically have two distinct networks in them. The first, and more traditional is all servers’ external or outward-facing “front-end” network, which need to be based on Ethernet and IP protocols as they face the public Internet. The main difference in AI is the need to get large amounts of data into the cluster, so the pipe is larger than a traditional web or email server. Future AI designs will drive multiple 112G SERDES lanes per server and manifest as 100G or 400G ports. As a result, AI server speeds for this network will be 1-2 generations ahead of traditional computing.
The second and new network is the internal or “back-end” network. This is a unique network connecting the AI clusters resources together. For an AI cluster, connecting to its shared storage and memory across compute resources and doing those tasks rapidly and without deviations in latency becomes critical to maximizing the cluster’s performance. Future AI designs for this new network will be multiple 400G, 800G or higher ports per compute server.
AI workloads are heavily dependent on this back-end network, as packet-loss or even jitter cause degradation in the workload performance measured in JCT (Job Completion Time), due to the increase in GPU idle-cycles awaiting network resources. This calls for a predictable, lossless back-end networking solution, which at scale is a significant challenge to any networking technology.
This is why the AI networking needs require new hardware and software solutions that can increase the AI cluster’s performance maximizing the use of AI compute resources. Such a new network can drive up to 10% cost savings of the entire AI infrastructure.
DriveNets Enters the Market with a New Approach and a Proven Architecture
DriveNets enters the AI networking market with a unique and proven architecture. Current solutions are based either on Ethernet Clos architecture, that is standard but cannot provide the required performance at scale, or proprietary solutions that have the right scale but ‘vendor lock’, DriveNets Network Cloud utilizes the OCP DDC (Open Compute Project Distributed Disaggregated Chassis) architecture which enable AI clusters to scale at a very high performance while keeping JCT to a minimum (much lower than with standard Ethernet). We note that this architecture runs the majority of AT&T’s traffic in the US and scales beyond the current needs of AI in terms of nodes. DriveNets brings this scale with impressive AI benchmarking results to a market that is trying to find an optimal solution. We view their entry as a positive for the industry as vendors stake their expertise in these new network designs.
Second Half 2023 and Beyond
We expect AI to race beyond the largest Hyperscalers with a combination of premises and tier-2 cloud-based offerings. As we look towards early 2024, we expect next-generation designs from the Hyperscalers and first-generation designs in the enterprise to increase dramatically as each vertical and enterprise embraces an AI-led digitization/modernization effort.
Download Brochure
Utilizing Distributed Disaggregated Chassis (DDC) for Back-End AI Networking Fabric