Heterogeneous AI – the next AI revolution

Getting your Trinity Audio player ready...

CPMT is the new TCO in Token Economy

When trying to validate the business sense of a technology infrastructure, Total cost of ownership (TCO) is usually the relevant metric, as it includes direct and indirect costs and creates a benchmark for a certain size of project. TCO in the compute world, is usually derived from the number of compute elements (e.g., number of GPU in a cluster) or the rate of computing actions supported by it (i.e., Floating point operations per second – FLOPS).

In AI infrastructure this metric could be misleading, or even flat-out wrong. The reason is that two seemingly-equivalent set-ups, with the same number of GPUs and a similar FLOPS figure can perform very differently in terms of what’s matters the most – the amount of tokens they can process at a given time. This difference is the outcome of networking performance, GPU adjustment to the workload type, and other factors.

To accurately benchmark an infrastructure, a new measure is needed – cost per million token (CPMT) – this takes the TCO figure and normalize it to the actual number of tokens supported by the infrastructure.

Download

DriveNets AMD System Reference Architecture

Heterogeneous AI – no longer ‘one size fits all’

The world of AI is shifting, from a training-dominant ecosystem to an inference biased industry. This marks a significant challenge for AI infrastructure as inference is a less-resource-homogeneous process, where different stages of the process (tokenization, prefill, decoding, detokenization) require a very different mix of resources (e.g., compute, memory access, etc.).

This difference between the infrastructure requirements for training and inference has led several leading ASIC vendors to develop distinct products optimized for different stages of the AI workflow. Examples include AWS Trainium and Inferentia, Google’s TPU 8i and TPU 8t, and NVIDIA’s GPU and LPU offerings.
And these are just same-vendor differentiated-ASICs. When it comes to a truly Heterogeneous AI architecture, optimized for any type of workload and, specifically, for any stage and substage within this workload, a multi-vendor, multi-ASIC architecture is called upon. In such architectures compute units from several vendors – those mentioned above and new specialized ones, like Cerebras, Arm and others, are used in a single AI cluster.

Such clusters can yield a near-perfect CPMT as it splits the workload into stages, and lets each type of processing unit in the cluster handle the part of the workload it is best for.
It does not come without challenges, naturally, as running a single code over such a heterogeneous underlaying infrastructure requires an abstraction layer that will “translate” the code to the specific languages of the compute clusters. Moreover, this requires more sophisticated scheduling and orchestration systems that will assign the right resource for each part of the workload process.

It is the networking…

And most of all, as usual, there is the networking part. We have already established the understanding that networking infrastructure is key for any AI cluster performance. With Heterogeneous AI, this becomes an even more important part of architecture. To truly achieve the highest performance, the networking part of the infrastructure needs to be not only robust (error-free, jitter-free, low tail latency etc.) but also to be full-stack optimized (i.e., optimized at the fabric, NIC, and CCL layers) for each and every type of GPU in the architecture and capable of handling the uneven traffic patterns unique to Heterogeneous AI environments. Only when this is achieved, can the economic promise of Heterogeneous AI be fulfilled, reaching the CPMT holy grail.
We believe in the value that Heterogeneous AI brings and are focusing our efforts on full-stack, end-to-end performance optimizations with multiple AI accelerators, as well as on muti-vendor heterogeneous clusters as a whole.

Key Takeaways

AI infrastructure economics must move beyond traditional TCO
In the token economy, the real benchmark is not just how many GPUs or FLOPS a cluster delivers, but how efficiently it processes tokens. CPMT – cost per million tokens – provides a more accurate measure of infrastructure value.
Inference is changing the infrastructure equation
As AI shifts from training-heavy workloads to inference-heavy deployment, infrastructure must support diverse processing needs across tokenization, prefill, decoding, and detokenization. A single accelerator type is no longer enough.
Heterogeneous AI is becoming essential for workload optimization
Multi-vendor, multi-ASIC architectures can assign each stage of the AI workflow to the most suitable processing unit, improving performance, efficiency, and CPMT.
Networking is the foundation that makes heterogeneous AI viable
To unlock the full economic promise of heterogeneous AI, the network must be robust, low-latency, and optimized across the full stack – fabric, NIC, and CCL – for different accelerator types and uneven traffic patterns.

Frequently Asked Questions

Why is Cost Per Million Tokens (CPMT) in AI infrastructure benchmarking?

Cost per million tokens (CPMT) is a benchmarking metric that normalizes total cost of ownership (TCO) to the actual number of tokens supported by an AI infrastructure. This measure accurately evaluates technology infrastructure performance in a token economy, accounting for variables like networking performance and GPU workload adjustments that affect token processing efficiency.

Why is traditional Total Cost of Ownership (TCO) misleading for AI infrastructure?

Traditional Total Cost of Ownership (TCO) is misleading because two setups with identical GPU counts and FLOPS can deliver completely different token processing performance. AI infrastructure efficiency depends heavily on networking performance and workload-specific GPU adjustments rather than raw compute elements, making standard FLOPS benchmarks an unreliable indicator of actual operational capacity.

What are the infrastructure challenges of shifting from AI training to inference?

Shifting from training to AI inference introduces resource-heterogeneous challenges because distinct stages require unique resource mixes. Stages like tokenization, prefill, decoding, and detokenization vary drastically in compute and memory access demands, requiring specialized ASIC solutions such as AWS Trainium, Google TPU, or NVIDIA GPU and LPU offerings.