Getting your Trinity Audio player ready...
|
Scale-out and scale-across – Ethernet is a done deal
For years, InfiniBand was the go-to networking technology for scale-out systems that required highest-performance connectivity (such as AI backend networks). Indeed, it was the benchmark for scale-out performance.
But in the last couple of years, new Ethernet-based technologies have emerged as valid alternatives – and, as of this year’s OCP event, it appears to be a done deal. It’s no longer a matter of InfiniBand vs. Ethernet – it’s a matter of which Ethernet flavor is right for your use case (or budget).
Why is this is the case? Ethernet, and specifically its high-performance variants (such as Fabric-Scheduled Ethernet), not only has closed the performance gap with InfiniBand, but has even outperformed it in many use cases. If you have technology that has better performance, is easier to source and operate, and is simpler to optimize, it’s a no-brainer for scale-out – not to mention for scale-across, for which only Ethernet-based systems are available.
Ethernet scale-out – Meta gives new names to existing solutions
Meta is the dominant voice at any OCP event (which is no surprise, as it is the driving force behind this organization). And Meta did not disappoint this time, as it was loud and clear in almost every discussion held at the event. When talking about an Ethernet backend network, it clearly defined the two main flavors of an Ethernet solution – DSF and NSF.
While we know DSF (Disaggregated Scheduled Fabric) and it synonyms (DDC, FSE, DES, etc.), NSF, standing for Non-Scheduled Fabric, is a new term. At DriveNets, we tend to call this architecture ESE (Endpoint-Scheduled Ethernet), as opposed to FSE (Fabric-Scheduled Ethernet); I think our way is more accurate, as there is some scheduling done at the endpoints (i.e., at the NICs).
With those two flavors defined, Meta also hinted as to when to use what – DSF for small-to-medium-size clusters and NSF for large clusters. When thinking of this distinction, it’s important to realize that this is Meta-scale, where “medium-size” refers to tens of thousands of xPUs and “large” refers to hundreds of thousands of xPUs…
Scale-up – the next battleground
After winning the scale-out and scale-across use cases, Ethernet is now targeting scale-up. Traditionally, this very demanding domain is all about competition between proprietary solutions (e.g., NVlink) and standardization efforts (CXL, UAL, etc.).
Now, OCP is aiming to push Ethernet into this domain. Backed up by multiple industry leaders, OCP recently announced ESUN – Ethernet for Scale-Up Networking (ESUN). This new OCP networking project workstream aims to leverage the advancement in Ethernet performance and capabilities by adopting Ethernet to the unique environment and requirements of scale-up.
OCP is going places
Other than those highlights mentioned above, it’s worth mentioning that this year’s event was the busiest ever (with ~11K attendees). It seems that OCP, as an organization, is leveraging the momentum of AI. This should be interesting to follow…
Key Takeaways
Ethernet has officially surpassed InfiniBand as the preferred technology for scale-out and scale-across systems, offering better performance, easier sourcing, and simpler operation.
Meta introduced two Ethernet backend flavors—DSF (Disaggregated Scheduled Fabric) and NSF (Non-Scheduled Fabric)—clarifying their roles in small/medium vs. large-scale AI clusters.
The new ESUN (Ethernet for Scale-Up Networking) initiative marks Ethernet’s move into traditionally proprietary domains, signaling its growing dominance across all networking tiers.
Related content for AI networking infrastructure
DriveNets AI Networking Solution
Latest Resources on AI Networking: Videos, White Papers, etc
Recent AI Networking blog posts from DriveNets AI networking infrastructure experts
White Paper
Fabric-Scheduled Ethernet as an Effective Backend Interconnect for Large AI Compute Clusters