OCP 2023 Wrap-Up (Part 1): Infrastructure, Networking, and AI Trends

As a follow up to my last article, the lovely folks at OCP published over 500 videos from the conference. FIVE HUNDRED.

I didn't exactly plan for the sheer quantity of sessions that I would be interested in, and after going down numerous rabbit holes, I plucked out roughly 50 or so relating to data center cooling, high-speed networking, and sustainability, with the hot-topic and undercurrent, without question, being AI.

Another theme was the disaggregation of systems.

Essentially, breaking the shackles of traditional server design, and moving components outside the chassis. This has been done for decades with storage in the form of arrays and appliances, however this is progressing across other technologies such as memory/RAM and GPUs utilizing the PCIe interface.

CXL is the underlying technology in OCP-land, and this open standard allows for some unique and super-dense designs, for example a 2U shelf with 16GPUs. It's a great innovation with lots of use cases and interesting problems of its own to solve (tail latencies, for example).

CXL will be a topic for another day, and perhaps a future article.

On with the show...

In this article, I'll cover some high-level theme's Infrastructure and Networking, with Sustainability and Disaggregation in a follow up tomorrow.

INFRASTRUCTURE

Hardware Architecture & Evolution - YouTube

Dermot shows what I've been harping on about for a while.

The top end is training, huge models, enormous centers of compute, but that inferencing is the long tail, and able to be processed on more modest systems.

If you want a good history lesson and get up to speed quickly on current technologies from ARM, NVIDIA, AMD and Intel, watch this video - lots of interesting nuggets.

I took a snap of AMD Instinct MI300X specs as I'm keen to see how it performs compared to NVIDIA's H100, and recently launched H200, as well as Intel's roadmap to provide some balance from mostly NVIDIA coverage.

The AI Datacenter - YouTube

Andy Bechtolsheim (co-founder of Sun Microsystems, co-founder of Arista, co-founder of OCP), puts it simply...

"The ultimate limit to scaling performance is power and cooling"

and

"The era of liquid cooled Datacenters has arrived with AI"

Liquid cooling shrinks a Grace Hopper based DGX solution (256 GPUs) from 24 racks to 2.

This video is a deep dive (and broad area of discussion) into how the industry gets the next 10 to 100x growth, presenting likely paths to achieve those outcomes including 3000W chips!

HPC- The Next Generation - A Sustainable - Energy Centric Architecture first - Innovation - YouTube

Allan Cantle from Nallasway conceptualizes a reimagined datacenter, and specially, reimaging 19" racks (a 100-year-old relic) and have this can liberate hardware and allow for more simplified and sustainable design. For example, a wall of compute, not a rack, with direct 48V DC and liquid heat transfer.

Huge opportunity for small players to take this to the market, in my opinion.

Grand Teton Systems Overview - YouTube

A look behind (inside) the scenes for a Meta produced SKU, called Grand Teton, and showing the open nature of their work and contributions to OCP. Interestingly didn't talk about liquid cooling (in this video) but discussed the huge engineering efforts needed to make the airflow work as optimally as possible.

GenAI at UBER: Scaling Infrastructure - YouTube

Uber demonstrate how they use AI/ML in the app - unsurprisingly, It's EVERYWHERE.

The presenters talk about their usage of GPUs for GenAI, and how they optimize, monitor and operate them. They are testing on a very modest system (32x A100 GPUs across 8 nodes and 25G Ethernet) and showing how models like LLaMA2 can leverage CXL to improve efficiency and reduce memory usage whilst using less network resources.

It's interesting to talk to folks around the use cases of AI when many aren't aware that they interact with AI every day and have for years. Amazon/Pinterest's recommendation engines, personalized feeds on Facebook, Instagram Netflix, Siri/Alexa voice control, even Spam filters.

Uber's been using machine learning since 2016, Deep Learning since 2019 and GenAI in 2023.

Accelerated Deployment of GPU-Based Systems Using DMTF Industry Standards - YouTube

OCP has a HGX design for a Universal GPU solution and working on publishing a 1.0 in 2024. This is based on NVIDIA's HGX specs. What’s great about OCP is that on stage there's an NVIDIA, Microsoft and Google have representation and working collaboratively to solve a problem (with Meta and AMD joining the workstream too).

NETWORKING

NVIDIA Spectrum-X Network Platform Architecture - YouTube

Ethernet as an alternative to InfiniBand, by the company that will happily sell you InfiniBand.

The image above is very helpful to demonstrate that Ethernet can be (and is) used in HPC/AI, with NVIDIA releasing high-powered Ethernet switches due to the customer need and resistance to adopt IB. NVIDIA's Spectrum-X supports 8000 GPUs in a two-tier architecture and can run the open-sourced SONiC software.

As I've stated many times before, the Ultra Ethernet Consortium seeks to solve IB problems with trusty Ethernet, and I'm hoping to see tangible results next year as expectation continues to grow.

A Next-Gen DPU-Accelerated Petabyte-Scale Storage Solution to Build Future Data-Centric Datacenters

MangoBoost discuss the commercialization of their off-the-shelf DPU (as an alternative to NVIDIA's Bluefield DPU). Accelerating (or avoiding) the "datacenter tax" of slow interconnection to shared storage devices, using TCP/IP as an alternative to InfiniBand or RoCEv2.

SONiC

There were MANY talks about SONiC, which I'll likely cover in a future article, but the cliff notes from OCP 2023 is that SONiC's market share is expected to explode, the areas of focus on Campus/Edge/DC, Smart Switch Acceleration, and SONiC for AI, and importantly it's running live, production telco workloads today at Orange.

Stay tuned for part two of OCP 2023 tomorrow.

OCP 2023 Wrap (1 of 2)

OCP 2023 Wrap (2 of 2)

MS Ignite 2023 (Silicon edition)