Getting your Trinity Audio player ready...
|
Finance research, banking, trading and fintech companies are at an infrastructure crossroad these days, as they require massive compute and storage infrastructure to run those AI/ML workloads.
One implementation option is to rent compute/GPU power from cloud providers and/or GPU-as-a-service providers. The other option for finance companies is to build their own infrastructure. In many cases, this is the option of choice for a couple of reasons. First, the need for these resources is ongoing, rather than transient. Second, such infrastructure (either on-premise or at a datacenter) is more available and a better fit for the specific applications and workloads that those players plan to run.
It’s the network, stupid
When building such AI/ML infrastructure in-house, multiple components need to be addressed:
- GPUs or other compute power sources – usually first in mind as they account for a major piece of entire project costs
- servers and compute peripherals
- on-server storage
- on-server networking (e.g., network interface controllers – NICs)
- storage, storage servers and storage networking
- networking infrastructure
- physical infrastructure, including racks, cabling, HVAC, etc.
Intuitively, the complexity of a component, or its effect on overall workload performance, is usually proportional (even linearly proportional) to its cost. It turns out that there is one major exception – the networking component.
Networking is responsible for about 10% of entire infrastructure costs. Yet the effort around networking installation, and, more than that, networking fine-tuning in order to reach the optimal workload performance (in terms of job completion time), can reach 80% of overall time and effort.
That’s why when planning such an infrastructure, finance companies should pay special attention to the networking part.
Selecting a networking fabric
When it comes to the networking fabric, there are a few parameters to check before selecting the best one for your infrastructure:
- Performance: This should be measured in terms of workload performance, which reflects the ability to fully utilize the GPU resources. From the networking side, this calls for a scheduled solution that will eliminate congestion and packet loss, reduce tail latency, and recover quickly from any failure.
- Openness: This reflects the diversity of the supply chain, avoiding vendor lock and significantly shortening time to deployment. An open solution (e.g., Ethernet-based) will also allow simplifying the network architecture by using the same network infrastructure for the backend and the storage fabric.
- Robustness: In order to avoid surprises and stability issues, the selected solution needs to be robust and field-proven.
Possible networking solutions include:
- the practically proprietary InfiniBand solution from Nvidia
- plain-vanilla Ethernet solutions from multiple providers
- endpoint-scheduled networking solutions like Nvidia’s Spectrum-X and future Ultra Ethernet solutions
- fabric-scheduled networking solutions like DriveNets Network Cloud-AI, Arista’s DES, and others
The best networking fabric
Out of the options mentioned before, InfiniBand fails in openness, while plain-vanilla Ethernet and, to some extent, endpoint-scheduled Ethernet fail in performance. Ultra-Ethernet solutions are not yet field-proven. While fabric-scheduled solutions excel in performance, when looking for an open and field-proven solution, at present there is a single solution that ticks all boxes – DriveNets Network Cloud-AI.
To learn more about this solution, click here.
eGuide
AI Cluster Reference Design Guide