AMD MI355X: Strong Node-Level Inference, but Not Yet Rack-Scale

Written by

Nick Hume

Share this article

Facebook Twitter/X LinkedIn Pinterest

AMD MI355X Analysis

Strong Node-Level Inference, but Not Yet Rack-Scale

288 GB

HBM3E Memory

8 TB/s

Bandwidth

1.4 kW

Power Draw

Quick Navigation

1. Overview
2. Performance Analysis
3. Memory Advantage
4. Interconnect Bottleneck
5. Scale-Out Gaps
6. Future Roadmap
7. Software Ecosystem
8. Final Analysis

1. Overview

AMD's CDNA 4-based MI350X and MI355X offer clear gains over the MI300 generation. With 288 GB of HBM3E, 8 TB/s of bandwidth, and support for FP4/FP6/FP8, they deliver impressive specs for memory-heavy workloads.

The MI355X is AMD's top-end SKU: 1.4 kW, liquid cooled, and tuned for high sustained throughput. But despite these specs, it's not a competitor to Blackwell at rack scale—at least not yet.

2. MI350X vs MI355X – Performance and Power

Key Insight: The liquid-cooled MI355X draws ~40% more power than the MI350X (1.4 kW vs 1.0 kW) for only a 7–10% uplift in TFLOPs. However, sustained performance is significantly higher due to improved cooling and fewer thermal throttles.

MI355X draws ~40% more power than the MI350X (1.4 kW vs 1.0 kW) for only a 7–10% uplift in TFLOPs.
But those numbers don't tell the full story: sustained performance is significantly higher on MI355X due to improved cooling and fewer thermal throttles.
It delivers up to 10 PFLOPS FP4 and 78.6 TFLOPS FP64, solid for dense inference workloads.

3. Memory Capacity Advantage

Where MI355X truly shines is HBM3E capacity:

At 288 GB per GPU, it surpasses GB200 and B200 (192–225 GB) significantly.
That makes it ideal for:
- Multi-model inference (e.g., LLM ensembles or per-tenant serving),
- Long-context windows (e.g., 64K+ token models),
- Pinned-memory RAG workflows, where transfer penalties are minimized.

This likely explains early adoption by AWS, GCP, Meta, and Oracle—where inference latency and flexibility drive ROI.

4. The Interconnect Bottleneck

But this is where MI355X hits a wall:

Infinity Fabric/XGMI provides ~153 GB/s of bidirectional bandwidth—overclocked PCIe5 links, arranged in a flat mesh.
No switching, no domain expansion—just 8 GPUs per node.
NVIDIA's NVLink: up to 900 GB/s, with a switched mesh enabling 72–128 GPU rack-scale deployments (NVL72, GB200).

The gap is especially visible in MoE workloads:

Critical Performance Gap: MI355X collective throughput is up to 18× slower than NVL72 for all-to-all operations.

5. Ethernet NICs and Scale-Out Gaps

MI355X nodes support 400 GbE, just like NVIDIA today.

But the landscape is moving fast:

800 GbE support is coming to NVIDIA NICs later this year.
AMD's next-gen NIC ("Vulcano") isn't expected until H2 2026—likely limiting MI355X deployments to moderate cluster sizes.

6. Looking Ahead: MI400/Helios and MI500

AMD's roadmap offers promise—but not today:

MI400/Helios (2026): 72-GPU rack-scale with UALink (~800 GB/s) and true domain coupling.
MI500/UAL256 (2027): 256-GPU scale-out via advanced Ethernet mesh.
Helios racks will use double-wide OCP full-depth OAM boards—a far cry from the half-depth testbed clusters being deployed today.

7. Software and Ecosystem

To AMD's credit, its developer relations have entered a "wartime" stance:

PyTorch CI/CD support is landing.
Compiler tuning and model compatibility are improving.
But CUDA/NCCL remains dominant for advanced training and large-scale orchestration.

8. Final Analysis

Factor	MI355X	NVIDIA (GB200/NVL72)
HBM capacity	288 GB	192–225 GB
Interconnect bandwidth	153 GB/s (XGMI mesh)	900 GB/s (NVLink)
Node scale	8 GPUs	72–128 GPUs (switched mesh)
Ethernet support	400 GbE	400 → 800 GbE (H2 2025)
Rack-scale deployment	Not yet	Mature
Next-gen systems	MI400/Helios in 2026	Already shipping

Strategic Conclusion: MI355X is not a GB200 killer—but it doesn't need to be, yet. It's a compelling inference node for workloads constrained by memory, not bandwidth.

But when it comes to large-model training, MoE, or rack-scale orchestration, the interconnect and NIC bottlenecks remain unsolved—until MI400 and UALink arrive in 2026.

For hyperscalers building clusters today, NVIDIA's vertical stack still wins. But AMD's roadmap has teeth—and this is the most serious challenger we've seen in a decade.