OCP 2023 Wrap (2 of 2)

If you are looking for the first half, check out my OCP 2023 Wrap (1 of 2) | LinkedIn

For the second (and final) half, I'm concentrating on Immersion + Sustainability (which go hand-in-hand), and Disaggregation.

IMMERSION

Advancing Immersion Cooling: Project Updates, Industry Trends and Consensus Glossary - YouTube

If you are looking for updates on the BIGGEST workstream across OCP, get it from the trusted sources Rolf Brink and Rich Lappenbusch. As both cold plate and immersion becomes mainstream, and one could argue cold plate is there already, there is a still mountains of work ahead.

Immersion Fluid Specification - YouTube

Jessica Gullbrand is a co-chair of the OCP Steering Committee and discusses the work being done to standardize specifications of immersion fluids. Intel announced they would warranty their CPU's at OCP 2022 with a collaboration with Lubrizol, marking an important milestone and validating the usage of immersion solutions to reject heat for compute.

Without standards, every team does their own R&D, swimming by themselves in the ocean, testing different oils getting different results and wading through new roadblocks each time something changed. There's so much work that gone into this, so congrats to Jessica and the folks in this workstream. Immersion is such a radical departure from 'how we have always done it' so material compatibility needs to be considered, things like thermal paste (transfers heat from a processor to a heatsink) shouldn't go into a tank, labels and stickers on equipment/cables/etc., will easily peel off and end up in your filters, risking contamination to the oil.

Love this.

Investigation of immersion cooling for data center hardware: progress, opportunities and challenges - YouTube

Cheng and Jayati from Meta demonstrate the advantages of immersion cooling, using OCP servers.  And this image below shows why.

And showing why the Immersion Fluid standard is so important, check out what happens to the fluid when materials have limited compatibility...


SUSTAINABILITY

Sustainability in Data Centers - Why Now?

Data Centers use more power than the airline industry, yet it's rarely talked about it.

I mean, you don't configure your AWS EC2 instances and adjust your carbon emission offsets like when you search on Google Flights.

Shruti and Frances present the public claims that Microsoft, Google and others have said publicly, however since we're basically in 2024, time is running out fast for company's who have set net-zero (or negative) emissions by 2030, in 6 years' time.  The biggest challenge is the scale of scope 3 emissions.

Personally, I'm not optimistic that a 6-year limited design change period for DCs will happen, nor a 4-year period on Hardware.

Sustainability Opportunities for Cloud Storage

I believe a huge focus for sustainability in data centers is redesigning the way data centers operate, primarily changing from inefficient air to liquid to continue to grow, but not waste so much power.  The heat reuse projects are important, creating a value chain.  Further, the reimagining of racks and chassis can also usher in large optimizations of power, heat and physical footprint, and believe this is tethered with the liquid cooling movement.

In the second half of the presentation, a suggestion is to design hardware for a longer lifespan, which, immersion cooled solutions do natively.

Beyond PUE, an infrastructure metric for sustainability

PUE has long been held up as a metric to measure the efficiency of the datacenter, simply energy used by IT equipment, and the % of energy needed to cool it.  Air-cooled bounces around 1.2-1.5, liquid and immersion cooling is circa 1.03 to 1.08.

Harish discusses using financial methodology (depreciation) on the IT assets and the embedded carbon associated with it.  A new metric I'm seeing for the first time is HUE (hardware usage effectiveness), Total IT Power / Information Processing Power.

Very interesting, though I would have liked to have a seen a more real-world example comparing the new measurements to PUE and other metrics.

PUE - Issues and Opportunities, and DCEM workstream work for enhancements

Extending from the previous video, Murugasamy (Intel) suggests many other ways to measure sustainability in the data center.

  • IUE (Infrastructure Utilization Efficiency)

  • NUE (Network Utilization Efficiency)

  • pPUE (Partial PUE)

Although, like current PUE measurements, it's difficult to compare as the devil is in the detail - what is in/out of scope?

You can game your PUE metrics by cranking up water usage for example.

Another complexity is power - as there are losses when converting DC to AC, having DC going directly to the rack, the efficiency can improve >10%.  Today's PUE measurement wouldn't surface that efficiency improvement.

Current WUE (Water Usage Effectiveness) doesn't include usage of brown water (versus drinking), nor one-time liquid assets like immersion oil or glycol.

DC Sustainability: Liquid Cooling for Optical Networking Equipment

Behzad and Peter from Ciena cover liquid cooling and dropping cold plates directly to their ASIC in their switches as their power usage, and heat produced, rises dramatically.

>800W ASIC chips can no longer be cooled sustainably with air, but cold plate can get 2-2.5x that (1500-2300W)

What's even more interesting, is the pluggable optical modules are getting in on the action!  Air is getting challenging at 25W, and cold plate can extend this to 33-55W.  QSFP-DD (1.6TB) will pull 30-40W. Next-gen removes the heatsink and integrates a cold plate on each plug, getting to 55-85W, 2-3x the cooling of air.

Make no mistake, we will see more and more networks with cold plate, and all-in with immersion - yes, switches in tanks.

DISAGGREGATION

AI/HPC: The Future of AI/ML Innovation is Row-Scale Disaggregation

Matthew Williams discusses Rockport Networks (now Cerio) view of disassociation of compute from the confines of the actual chassis.

By decoupling the GPU (and other cards) from the chassis and mounting them on a shelf of cards (called a SHFL), a passive optical interconnect.  Essentially a large chassis with only PCIe interfaces, you can mix/match, as well as upgrade without having to re-design an architecture.  They leverage a PCIe switch to interconnect, almost like what RAID array storage devices did for storage back in the day.

Very interesting, I love the re-purposing of PCIe and firstly wondered about latency, funnily enough, this was the very first question asked in the QnA and according to Matthew, disaggregation impact is <1%.

My other concern is around pooling of resources, essentially, it's a version of NVLink and NVSwitch, so accessing those resources optimally is a big challenge.  Super interested to watch this play out.

Broadcom Accelerates PCIe/CXL Roadmap to Enable the Open AI Ecosystem

Another CXL discussion and announcement, essentially Broadcom talking about their interest and commitment (roadmap) for investing into PCIe switches, tracking from current PCIe5 to PCIe6 next year, PCIe7 a year later (aggressive) and beyond.

Scale-up and Scale-out challenges for disaggregated infrastructure

A very comprehensive overview (oxymoron?) by Siamak of what it takes to build a 4TFLOP system, no, it's not just GPUs.  Interconnects, latency, bandwidth, and memory matter, as does the configuration of each to ensure they work optimally.

A great comparison showing different ways to achieve high-bandwidth, for example that CXL over next-gen PCIe 6 will offer slightly more bandwidth (1024GB/s) that NVLink provides today (900GB/s). One might assume that NVIDIA continues to add 300GB/s each generation of NVLink, and will do so next year when it brings Blackwell (Hopper Next) to market, which will leapfrog this again.

SUMMARY

There is so much more to OCP than what I've covered here - I would highly encourage an in-person experience, and if it really floats your boat, get involved in the community contributions. Lots of smart, passionate folks rowing in the same direction. The next conference is the Regional Summit in Lisbon in April.

2024 OCP Regional Summit » Open Compute Project

Previous
Previous

Beyond 2023

Next
Next

OCP 2023 Wrap (1 of 2)