Building an AI Infrastructure for Pharma and Biotech Companies

Getting your Trinity Audio player ready...

Enterprises in these fields are at an infrastructure crossroad these days, as they require massive compute and storage infrastructure to run those research and analysis workloads. One implementation option is to rent compute/GPU power from cloud providers and/or GPU-as-a-service providers.

The other option for enterprises is to build their own infrastructure. In many cases, this is the option of choice for a couple of reasons. First, the need for these resources is ongoing, rather than transient. Second, such infrastructure (either on-premise or at a data center) is more available and a better fit for the specific applications and AI workloads that the enterprises plan to run.

Download now!

Fabric-scheduled Ethernet as an effective backend interconnect for large AI compute clusters

AI Infrastructure for pharma and biotech: It’s the network, stupid

When building such AI infrastructure in-house, multiple components need to be addressed:

GPUs or other compute power sources – usually first in mind as they account for a
major piece of entire project costs
servers and compute peripherals
on-server storage
on-server networking (e.g., network interface controllers – NICs)
storage, storage servers and storage networking
networking infrastructure
physical infrastructure, including racks, cabling, HVAC, etc.

Intuitively, the complexity of a component, or its effect on overall workload performance, is usually proportional (even linearly proportional) to its cost. It turns out that there is one major exception – the networking component.

Networking is responsible for about 10% of entire infrastructure costs. Yet the effort around networking installation, and, more than that, networking fine-tuning in order to reach the optimal workload performance (in terms of job completion time), can reach 80% of overall time and effort.

That’s why when planning such an infrastructure, pharma, biotech and other life science enterprises should pay special attention to the networking part.

Selecting a networking fabric for AI Infrastructure

When it comes to the networking fabric, there are a few parameters to check before selecting the best one for your infrastructure:

Performance: This should be measured in terms of workload performance, which reflects the ability to fully utilize the GPU resources. From the networking side, this calls for a scheduled solution that will eliminate congestion and packet loss, reduce tail latency, and recover quickly from any failure.
Openness: This reflects the diversity of the supply chain, avoiding vendor lock and significantly shortening time to deployment. An open solution (e.g., Ethernet-based) will also allow simplifying the network architecture by using the same network infrastructure for the backend and the storage fabric.
Robustness: In order to avoid surprises and stability issues, the selected solution needs to be robust and field-proven.

Possible networking solutions include:

the practically proprietary InfiniBand solution from Nvidia
plain-vanilla Ethernet solutions from multiple providers
endpoint-scheduled networking solutions like Nvidia’s Spectrum-X and future Ultra Ethernet solutions
fabric-scheduled networking solutions like DriveNets Network Cloud-AI, Arista’s DES, and others

The best networking fabric for AI Infrastructure

Out of the options mentioned before, InfiniBand fails in openness, while plain-vanilla Ethernet and, to some extent, endpoint-scheduled Ethernet fail in performance. Ultra-Ethernet solutions are not yet field-proven. While fabric-scheduled solutions excel in performance, when looking for an open and field-proven solution, at present there is a single solution that ticks all boxes – DriveNets Network Cloud-AI.

Explore more on AI networking architecture

Case Study: Uncompromised GPU Performance with Ethernet-based AI Fabric

DriveNets AI Networking Solution

Latest Resources on AI Networking: Videos, White Papers, etc

Recent AI Networking blog posts from DriveNets AI networking infrastructure experts

eGuide

AI Cluster Reference Design Guide