Service Providers and AI – from Bottom Line to Top Line

Getting your Trinity Audio player ready...

However, when it comes to AI and service providers, there are two main synergies that can be discussed, which are very different from each other…

Download

The Four Dimensions of Network Convergence

AI for networking – the bottom-line effect

The first, and more common, synergy between AI and telecom SPs has to do with how AI can change the operational structure of a telco. The introduction of AI-enabled tools can lead to significant automation in the entire operational process of the telco. This is actually a big step towards an autonomous network, which can lead to end-to-end operational automation and, in turn, redefine the OpEx model of any telco.

The main business effect of such an advancement is on the telco’s bottom line. Delivering a significant reduction in operational expenses, AI can increase bottom-line profitability substantially, even though AI, in this case, does not generate new revenue streams.

Networking for AI – potential top-line upside

At the other end of a telco’s P&L statement lies another synergy. This time, the telco’s networking infrastructure can actually serve the AI boom. This potentially creates new revenue streams for SPs, which, let’s face it, are hungry for any type of new revenue streams.

This potential is realized by becoming neocloud providers of GPU as a service (GPUaaS), which often is referred to as a kind of IaaS (Infrastructure as a service) or AIaaS (AI as a service).

The idea is to leverage SPs’ infrastructure – including central office and edge sites, power, AC, and connectivity infrastructure – to propel the introduction of AI infrastructure services. Distributing GPUs at edge sites, for instance, can create a powerful AI inference infrastructure that can be leased to multiple customers, including enterprise customers, cloud providers, LLM lab foundations, and government entities.

This is even more appealing outside the US where multiple governments are seeking to secure domestic infrastructure for AI, and the “usual suspects” for such implementations are the local telcos.

GPUaaS – SP challenges

Though creating great potential for new revenues, this new line of business does not come without challenges for SPs wishing to pursue it:

New type of networking: Building a neocloud infrastructure is very different than building a telco network. The networking requirements, in the context of quality of service (e.g., packet loss, jitter, tail latency, and failure recovery), are at a different scale than what is usually required in a telecom network.

New type of service: Service providers usually provide a connectivity service, and, from time to time, a security service. Now they need to start providing a compute service (and a very specific and unique one), with new types of challenges and metrics such as “noisy neighbor,” “job completion time (JCT)” and “CCL bus bandwidth.” This is new expertise required from SPs offering this new line of business.

New type of customers: Who are the potential customers for such a service? Target customers include cloud providers that offload workloads to external infrastructure, enterprises that prefer to run their workloads domestically but not build an on-prem infrastructure, and others. These are not typically the types of customers SP sales organizations are working with on a regular basis.

In order to overcome those challenges, service providers typically partner with a vendor that is knowledgeable in both the service provider and the AI infrastructure markets. Such a partnership can help SPs make the most of this great business opportunity.

GPUaaS – SP checklist

In order to make SPs lives easier, here is a short checklist to follow when considering and building GPUaaS activity.

Location

Identify an available space for GPU infrastructure buildout. This should include 3 main assets:

Power: enough power to accommodate long-term scale of GPUs. 1MW of available power will be enough to power a cluster of between 500 and 1500 GPUs (depending on type and generation)
HVAC (a must): Cooling system capable of managing the high-density heat generated by the planned infrastructure. Preferably a liquid cooling infrastructure.
Connectivity: Dark-fiber of free lambdas on a DWDM system are required for connecting the AI infrastructure to its customers (in case of interference) or to other sites (in case of a scale-across system)

Networking

Here, 4-5 networking technologies need to be considered:

Scale-out: This is the most important part of the GPU cluster as it connects GPUs to each other across the entire datacenter. This connectivity needs to be predictable and lossless, and specifically, show very low tail-latency, close-to-zero jitter, no packet loss and very quick recovery from failures. InfiniBand is considered a benchmark for such performance, but Ethernet-based technologies that include scheduling mechanisms (either in the fabric or at the endpoints) can achieve similar (if not superior) performance figures.
Scale-up: This is used for intra-rack connectivity, and should show similar performance to scale-out, with higher throughput and lower latency. Here, NVIDIA’s NVlink is a leading technology, but other technologies like OCP’s ESUN/SUE-T Ethernet based alternative are gaining traction
Scale-across: in case multiple sites are used to deploy a single GPU cluster (due to, for instance, lack of sufficient power in a single site), a network extension is required with scale-out like performance plus the capability of compensating for latency induced by the distance between sites.

Leveraging existing assets

Location assets, additional assets could be utilized in order to make an appealing business case:

Connectivity infrastructure
Networking know-how
Supplier/supply-chain relationships, ownership
customer relationships
OSS and BSS

Business case

Finally, a thorough business case analysis should be performed. In particular, in terms of demand, target markets, competitiveness and cost structure.