Getting your Trinity Audio player ready...
|
Enterprises in these fields are at an infrastructure crossroad these days, as they require massive compute and storage infrastructure to run those research and analysis workloads. One implementation option is to rent compute/GPU power from cloud providers and/or GPU-as-a-service providers.
The other option for enterprises is to build their own infrastructure. In many cases, this is the option of choice for a couple of reasons. First, the need for these resources is ongoing, rather than transient. Second, such infrastructure (either on-premise or at a datacenter) is more available and a better fit for the specific applications and workloads that the enterprises plan to run.
AI Infrastructure: It’s the network, stupid
When building such AI infrastructure in-house, multiple components need to be addressed:
- GPUs or other compute power sources – usually first in mind as they account for a
- major piece of entire project costs
- servers and compute peripherals
- on-server storage
- on-server networking (e.g., network interface controllers – NICs)
- storage, storage servers and storage networking
- networking infrastructure
- physical infrastructure, including racks, cabling, HVAC, etc.
Intuitively, the complexity of a component, or its effect on overall workload performance, is usually proportional (even linearly proportional) to its cost. It turns out that there is one major exception – the networking component.
Networking is responsible for about 10% of entire infrastructure costs. Yet the effort around networking installation, and, more than that, networking fine-tuning in order to reach the optimal workload performance (in terms of job completion time), can reach 80% of overall time and effort.
That’s why when planning such an infrastructure, pharma, biotech and other life science enterprises should pay special attention to the networking part.
Selecting a networking fabric for AI Infrastructure
When it comes to the networking fabric, there are a few parameters to check before selecting the best one for your infrastructure:
- Performance: This should be measured in terms of workload performance, which reflects the ability to fully utilize the GPU resources. From the networking side, this calls for a scheduled solution that will eliminate congestion and packet loss, reduce tail latency, and recover quickly from any failure.
- Openness: This reflects the diversity of the supply chain, avoiding vendor lock and significantly shortening time to deployment. An open solution (e.g., Ethernet-based) will also allow simplifying the network architecture by using the same network infrastructure for the backend and the storage fabric.
- Robustness: In order to avoid surprises and stability issues, the selected solution needs to be robust and field-proven.
Possible networking solutions include:
- the practically proprietary InfiniBand solution from Nvidia
- plain-vanilla Ethernet solutions from multiple providers
- endpoint-scheduled networking solutions like Nvidia’s Spectrum-X and future Ultra Ethernet solutions
- fabric-scheduled networking solutions like DriveNets Network Cloud-AI, Arista’s DES, and others
The best networking fabric for AI Infrastructure
Out of the options mentioned before, InfiniBand fails in openness, while plain-vanilla Ethernet and, to some extent, endpoint-scheduled Ethernet fail in performance. Ultra-Ethernet solutions are not yet field-proven. While fabric-scheduled solutions excel in performance, when looking for an open and field-proven solution, at present there is a single solution that ticks all boxes – DriveNets Network Cloud-AI.
Read more about DriveNets Network Cloud-AI Fabric-Scheduled Ethernet solution.
eGuide
AI Cluster Reference Design Guide