Building an AI/ML Infrastructure for Finance Companies

Getting your Trinity Audio player ready...

Finance research, banking, trading and fintech companies are at an infrastructure crossroad these days, as they require massive compute and storage infrastructure to run those AI/ML workloads.

One implementation option is to rent compute/GPU power from cloud providers and/or GPU-as-a-service providers. The other option for finance companies is to build their own infrastructure. In many cases, this is the option of choice for a couple of reasons. First, the need for these resources is ongoing, rather than transient. Second, such infrastructure (either on-premise or at a data center) is more available and a better fit for the specific applications and workloads that those players plan to run.

Download now!

Fabric-scheduled Ethernet as an effective backend interconnect for large AI compute clusters

It’s the network, stupid

When building such AI/ML infrastructure in-house, multiple components need to be addressed:

GPUs or other compute power sources – usually first in mind as they account for a major piece of entire project costs
servers and compute peripherals
on-server storage
on-server networking (e.g., network interface controllers – NICs)
storage, storage servers and storage networking
networking infrastructure
physical infrastructure, including racks, cabling, HVAC, etc.

Intuitively, the complexity of a component, or its effect on overall workload performance, is usually proportional (even linearly proportional) to its cost. It turns out that there is one major exception – the networking component.

Networking is responsible for about 10% of entire infrastructure costs. Yet the effort around networking installation, and, more than that, networking fine-tuning in order to reach the optimal workload performance (in terms of job completion time), can reach 80% of overall time and effort.

That’s why when planning such an infrastructure, finance companies should pay special attention to the networking part.

Selecting a networking fabric

When it comes to the networking fabric, there are a few parameters to check before selecting the best one for your infrastructure:

Performance: This should be measured in terms of workload performance, which reflects the ability to fully utilize the GPU resources. From the networking side, this calls for a scheduled solution that will eliminate congestion and packet loss, reduce tail latency, and recover quickly from any failure.
Openness: This reflects the diversity of the supply chain, avoiding vendor lock and significantly shortening time to deployment. An open solution (e.g., Ethernet-based) will also allow simplifying the network architecture by using the same network infrastructure for the backend and the storage fabric.
Robustness: In order to avoid surprises and stability issues, the selected solution needs to be robust and field-proven.

Possible networking solutions include:

the practically proprietary InfiniBand solution from Nvidia
plain-vanilla Ethernet solutions from multiple providers
endpoint-scheduled networking solutions like Nvidia’s Spectrum-X and future Ultra Ethernet solutions
fabric-scheduled networking solutions like DriveNets Network Cloud-AI, Arista’s DES, and others

The best networking fabric

Out of the options mentioned before, InfiniBand fails in openness, while plain-vanilla Ethernet and, to some extent, endpoint-scheduled Ethernet fail in performance. Ultra-Ethernet solutions are not yet field-proven. While fabric-scheduled solutions excel in performance, when looking for an open and field-proven solution, at present there is a single solution that ticks all boxes – DriveNets Network Cloud-AI.

To learn more about this solution, click here.

Building an AI/ML Infrastructure for Finance Companies

It’s the network, stupid

Selecting a networking fabric

The best networking fabric

Related content for AI networking architecture