AI: Essential Considerations for Hosting Your Own Models
Artificial Intelligence (AI) has become a pervasive buzzword in the industry, evoking various reactions including an amusing supercut of Sunar Pichai at Google's 2023 I/O...
AI demands high-powered compute, which necessitates effective cooling, with immersion cooling being a popular choice. However, this article will focus on other aspects of AI.
Given the immense hype surrounding AI, it's essential to clarify that AI covers a broad spectrum. It is sometimes used interchangeably with Machine Learning and Visual Computing, which are their own distinct areas. Machine Learning encompasses applications like Amazon Alexa, self-driving cars, and email spam filtering, while Visual Computing includes Metaverse applications and Digital Twins like NVIDIA's Earth 2.
When people mention AI, they are usually referring to Large Language Models (LLMs), with ChatGPT and LLaMA being two prominent examples. While comparing them, I won't delve into the pros and cons or open/closed sourced distinctions but will use them as examples.
At a high level, people often seek answers to questions like:
How do I host my own PrivateGPT What's the different between training and inferencing? Do I need to upgrade my network? Can I use my existing CPU's?
Let's delve into each of these areas.
How do I host my own PrivateGPT?
Many individuals are interested in creating their own ChatGPT with private access and sovereignty. Let's take the example of the largest GPT3.5 model, gpt-3.5-turbo, which is trained on approximately 17TB of text data and contains around 175 Billion parameters. However, running such a model can be expensive, as estimated by Tom Goldstein in December 2022.
Although optimization of models, and faster hardware advancements (H100) have occurred since then, it's essential to consider the costs of running large models. Additionally, newer models like GPT4, with 10x more parameters (1.7 Trillion) and data, are continuously emerging.
For those building their AI models, using pre-trained, open-source models like Meta's LLaMA2 could be a cost-effective option. Although not an exact comparison, using Tom's estimation logic, a model like LLaMA2 with 40% of ChatGPT's size and 1% of the daily queries (e.g., 100k) might cost around $400 per day on Azure, assuming the same $3 per hour rate from December 2022.
What's the difference between training and inferencing?
AI comprises two main aspects: training and inferencing.
Training involves using diverse data to create a model, while inferencing entails using the model to generate predictions or responses.
ChatGPT (gpt3.5) is known to have used 10,000 GPUS (A100) and several months to train. For small companies, using off-the-shelf, pre-trained models and running inferencing is a more cost-effective option. However, medium and large companies may want to train their models and can consider renting compute resources from hyperscalers or GPU cloud companies.
Training typically requires significant GPU memory, while inferencing can be accomplished on more modest hardware, including CPUs. Larger and more real-time inferencing use cases might require higher-end, or specialized hardware like GPU's.
Do I need to upgrade my network?
NVIDIA promotes InfiniBand for networking, especially with A100 and H100 chips that work optimally with this technology. While InfiniBand creates a networking fabric to pool chips using NVSwitch, it also has some downsides, such as being proprietary and expensive.
When evaluating networking options, consider your use case and the way GPUs work. For example, NVIDIA GPUs can be bought with PCIe or NVLink connectivity, and NVSwitch allows for interconnecting nodes.
To keep things simple, networking choices can range from;
10-100G for 1-16 GPUs using PCIe
100G/200G+ for 16-60 GPUs using NVLink
200/400G+ for 100+ GPUs requiring the fastest speed and lowest latency.
Note: Be sure to consider availability of networking gear before deciding a path for upgrading or augmenting your existing network as the networking supply chain is still very lumpy for high demand components.
While InfiniBand has its keen followers and vocal opponents, there are ongoing efforts by companies like Arista, Cisco, Broadcom, AMD, and Intel to compete with InfiniBand through the Ultra Ethernet Consortium (UEC).
Expect products based on UEC standards to be available in 2024.
Can I use my existing CPU's?
While AI infrastructure primarily focuses on specialized hardware like GPUs, FPGAs, and ASICs, don't overlook the role of general-purpose CPUs. Existing CPUs can be utilized to support AI infrastructure, performing tasks such as dataset aggregation, querying, hosting value-added AI services, or running some inferencing.
Although CPUs are less efficient than GPUs, FPGAs, and ASICs in terms of performance per watt, they remain important components in an AI setup.
Summing it up, hosting your own AI models involves careful consideration of factors such as cost, hardware capabilities, networking, and the role of CPUs. New, and updated models are released frequently, as are efforts to optimize for less costly, more available hardware. Qualcomm have announced they are working with Meta to optimize LLaMA2 to enable on-device AI which will flow through to Android phones next year, and MLC-LLM can run natively on your iPhone today.
As AI technology progresses, staying informed and optimizing your approach will be crucial for a successful implementation.