TL;DR - 2023 OCP Global Summit

Whether you're a startup seeking to lay a strong tech foundation, a mid-size company aiming to optimize infrastructure, or a large corporation lacking a comprehensive technology strategy, my expertise can help. With broad experience as a technology executive, and specializing in sustainable digital infrastructure and scaling out operations, Hume Consulting offers advisory services to design technology strategies, diagnose and resolve operational inefficiencies, and utilize a data driven approach to add value across your business.

Let's collaborate to bring focus and strategic depth to your technology endeavors.

I was super thankful for being able to attend Open Compute Project's (OCP) Global Summit in San Jose last year and planned to head down again this year. Unfortunately, a client engagement took priority this year, however the great team at OCP have started publishing this year's content online, starting with the kick-off Keynote, and there's plenty to be excited about.

I plan to cover topics such as immersion cooling, data center, HPC, and networking, as more workshops, presentations and sessions are released online.

What is OCP?

Open Compute Project is exactly what is says on the tin.

It's a collaboration across the tech community, focused on redesigning technology to support the changing demands on infrastructure. With roots at Facebook circa 2009, a team challenged with supporting an exploding social media platform designed a more efficient to build and run, an at scale infrastructure. This innovation was an integral part of initiation of the OCP Foundation in 2011.

In a historically proprietary based industry, OCP has brought together all the big hitting heavy weights across the globe, encouraging openness of collaboration, community and standards. The board has members from Microsoft, Intel, Meta and Google.


2023 Global Summit review

I've embedded the full keynote below, however there's a few key things that interested me during the presentations.

#INTEL

"AI Training models are 10x YoY" - Zane Ball (Intel)

  • Global DC power usage expected to 2x in 5 years! (2022 to 2027)

  • Intel's Sierra Forest processor, during 1H 2024 has 288-cores (importantly efficiency cores, not performance) and could power a rack with up to 11,000-cores

My rough math that is 1 core per RU and at least 400W per proc (could be higher) utilizing Intel 3 fabrication process - about 15kW, for just the CPU in the rack. Add anything currently standard (for air cooling) 1 GPU per RU to that mix, adds another 26kW. >40kW.

I appreciate Zane (and Intel's) openness when talking about alternate cooling technologies like liquid and immersion. Importantly, in 2021 Intel partnered with Lubrizol to warranty their chips in immersion cooling.

Zane also covers an important topic of the practical implementation of smaller, more targeted AI models "expert models" for example Meta's LLaMA 2 for getting similar outcomes using fractions of compute (and power).

#MARVELL

"Network is the new bottleneck"- Loi Nguyen (Marvell)

  • DCs today are ~32MW | New Builds are 1000W (1GW)

  • This is the capacity of a typical nuclear power plant (100,000 homes)

  • Campus Regions today are ~1GW | New Builds are multi-GW!

  • AI accelerates bandwidth 6x in 3 years, at least!  1.6T interconnects are coming.

Even to produce a 32,000 GPU cluster, there is a minimum of 7:1 oversubscription as there isn't a big enough switch to connect without blocking.

One of my favorite presentations!

 

#BROADCOM

"The network is the compute" - Ram Velaga (Broadcom)

  • Ethernet is an open standard by default, with a large ecosystem

  • RDMA vs UltraEthernet; big improvements like better scaling, selective retransmit, and ease of configuration

  • Basically, every limitation of Infiniband will be a feature/fix in UltraEthernet

Ethernet continues to evolve, with even NVIDIA, who push Infiniband creating their own high-speed (51.2T) Ethernet switch, given customer need.

I'm very excited for UltraEthernet - you can find out more here -> Ultra Ethernet Consortium

#PROMERSION

(the immersion project is the) "largest project in Open Compute...is no longer hypothetical" - Rolf Brink (Promersion)

  • Door HX, Cold Plate and Immersion are the three streams of next-gen cooling

  • Even storage is pushing power limits, current OCP Storage chassis at 2kW

  • Liquid in any form, is here, TODAY

  • Cold plate is becoming commodity, immersion is still emerging

Rolf gives a great overview of the cooling industry and presents a realistic position that no one cooling solution will be the only solution, and all datacenters will be dealing with some form of liquid cooling, in the next 5 years.

 

#META

"...we are very, very far from having a single solution that could work for all of these different kinds of workloads" - Dan Rabinovitsj (Meta)

  • AI is pushing every infrastructure boundary

  • Optimizing infrastructure for 1-2 workloads means compromising others

This is by far my favorite chart during the keynote.

Whilst every chart was a hockey stick, showing growth of spend, power, GPUs, heat, etc, Dan visualizing that a one-size-fits-all strategy, at least for AI, is incredibly hard, expensive, and ultimately in-efficient.

Wrap-up

OCP, is a significant initiative that has reshaped the tech community's approach to infrastructure. Its collaboration and emphasis on open standards have drawn in major industry players, fostering innovation and cooperation.

Intel's focused on AI training models and the growth in global data center power usage is a noteworthy trend. Marvell's insights into the challenges of network capacity in data centers are also crucial, and Broadcom's evolution of Ethernet standards is an interesting move. The immersion cooling project's growth and the practicality of different cooling solutions, as presented by Promersion, are important in the context of energy-efficient data centers. Lastly, Meta's emphasis on the complexity of infrastructure optimization for AI workloads is a compelling perspective.



Have questions about digital infrastructure, future trends or AI? I'd love to connect and help.

Previous
Previous

MS Ignite 2023 (Silicon edition)

Next
Next

Operational Excellence; what's a 'CoE'?