Here Come the Inferencing ASIC’s

The tidal wave of Generative AI (GenAI) has mostly consisted of training large language models (LLM's), like GPT-4, and the huge amount of compute needed to process these enormous datasets, e.g. GPT-4 has 1.76 trillion parameters.

This compute has mainly looked like NVIDIA's GPUs, but you also need...

power
networking
capital, AND
a nice cool place to host them (data center)

The looooooong tail of AI Inferencing will dictate that compute is installed closer to where it's needed for latency sensitive use cases, needs to be more cost effective, and more efficient.

GPU's are great at Training and Inferencing workloads, however the demand for these chips has meant NVIDIA is able to price them high (an H100 is approximately $30k), with typical installations (10's to 100's of GPUs) running in the millions of dollars.

Given the high cost, it's no surprise that tech giants like Amazon, Google, Intel, Microsoft, Meta, and Tesla are developing their own silicon (based on ASICs) to enhance performance, efficiency, and scalability in AI applications.

Before exploring the specific ASIC innovations, it is helpful to understand how they compare to General Processing Units (GPUs), which are commonly used for AI tasks:

Design Purpose: ASICs are custom-built for specific tasks, such as AI inferencing, making them highly efficient for those operations. In contrast, GPUs are more versatile and are designed to handle a variety of computational tasks, including graphics rendering and scientific computations.
Performance and Efficiency: ASICs often deliver higher performance and greater power efficiency for specific AI tasks compared to GPUs. This is because ASICs eliminate unnecessary functionality, optimizing their architecture for particular AI algorithms.
Flexibility: While GPUs offer greater flexibility and are suitable for a range of tasks, this comes at the cost of reduced efficiency for specific applications. ASICs, however, are not flexible but excel in the tasks they are designed for.
Cost and Development Time: Developing ASICs requires more time and resources upfront due to the need for custom design and fabrication. GPUs, being mass-produced for a broader market, often come with lower individual costs and are readily available.

Efficient inferencing solutions are crucial for real-time decision-making and advanced analytics in the realm of artificial intelligence (AI). Application-Specific Integrated Circuits (ASICs) represent a pivotal evolution in hardware, designed specifically to meet the intensive demands of AI inferencing tasks.

Amazon's Inferentia

Amazon Web Services (AWS) has made significant strides with its Inferentia2 processors, updating them in late 2022 from the original Inferentia chips 4 years earlier. Not only to be confused with Trainium and Trainium2, which are positioned for, yes, training workloads, Inferentia2 silicon is tailored for high-performance AI inferencing. By developing these processors in-house, AWS can provide better integration with its cloud services and offer more attractive pricing models to its customers. AWS's homegrown AI compute engines are part of a broader strategy to undercut competitors by leveraging custom silicon that reduces reliance on external suppliers like NVIDIA. This move not only enhances AWS's service offerings but also promises substantial cost savings and performance gains for end-users, particularly in large-scale AI deployments.

Amazon is racing to catch up in generative A.I. with custom AWS chips

Google's Tensor Processing Units (TPUs)

In late 2023, Google introduced the Cloud TPU v5e, their most powerful AI accelerator at the time, designed to enhance efficiency and processing power for the most demanding AI and machine learning workloads. This new model aims to reduce computational times for both training and inferencing large AI models, thereby boosting productivity on Google Cloud. Following the TPU v5e, Google also announced the TPU v5p, which specifically targets training workloads. The release of these TPUs underscores Google's continuous efforts to advance its AI hardware capabilities, offering robust solutions for complex computing tasks. This development marks a significant advancement in high-performance AI computing resources, positioning Google Cloud as a compelling choice for organizations looking to scale their AI operations.

Inside a Google Cloud TPU Data Center - YouTube

Intel's Gaudi

Intel's latest release, the Gaudi 3 AI accelerator, revealed earlier this month, marks a significant advancement in their lineup, poised to challenge NVIDIA's H100 GPU. Detailed in a recent Ars Technica article, the Gaudi 3 is engineered to enhance both AI inferencing and training capabilities significantly. It offers improved tensor processing and energy efficiency, making it ideal for data centers with intensive AI and machine learning workloads. This update reflects Intel’s aim to boost performance while managing costs, appealing particularly to those upgrading their AI infrastructure.

The Gaudi 3 represents Intel's ongoing efforts to expand its influence in the high-performance computing sector. By optimizing for deep learning tasks, this chip demonstrates Intel’s focus on meeting the sophisticated demands of modern AI applications, positioning it as a competitive choice for enterprises focused on scalability and efficiency in their AI operations.

Here's a quick table with as much information as is publicly available.

An NVIDIA H100 GPU runs up to 700W and produces 1980 TOPS (INT8).

Others

Microsoft's Maia100 chip is part of its strategy to enhance AI inferencing within Azure. While performance details are not publicly available, the chip is designed to integrate well with Azure's infrastructure, improving the platform's AI capabilities. Meanwhile, Meta is developing custom AI chips to better support its real-time applications and reduce its dependency on external hardware suppliers. These efforts aim to improve the integration of Meta's hardware and software, enhancing user interactions across its platforms.

Tesla's in-house AI chips are key to its autonomous driving technology, processing data from multiple sensors to support real-time decision-making in its vehicles. This development reflects Tesla's approach to maintain control over its technology, crucial for the functionality of its electric vehicles.

Wrapping it up

The advancements in ASIC technology for AI inferencing are reshaping the capabilities of AI applications, enabling faster, more efficient decision-making and analytics.

As the table above shows, ASIC silicon is relatively immature compared to NVIDIA's GPU journey, however improving quickly - NVIDIA's original Tensor Core technology was introduced in 2016.

The main challenge with ASIC's is if you want to roll your own infrastructure, you only really have one option, Intel, else you'll be stuck in the cloud ecosystem that Amazon, Google or Microsoft provide, at least without the burden of dealing with low-level ASIC software to wrangle it.

This isn't too different to the GPU route, where the obvious solution is NVIDIA - however AMD are at least trying to keep them honest with their MI300X and beyond chips.

As companies like Amazon, Google, Intel, Microsoft, Meta, and Tesla continue to innovate, the landscape of AI hardware is set to evolve, driving forward the potential of AI across all facets of life. These custom chips represent not only a technical evolution but also a strategic imperative to harness the full potential of AI.

Here come the Inferencing ASIC's

OCP 2024 Regional Summit wrap

GTC 2024 post-conference