Smart Factory
7 Minutes Read

AMD MI355X: Strong Node-Level Inference, but Not Yet Rack-Scale

Written by

AMD MI355X Analysis

Strong Node-Level Inference, but Not Yet Rack-Scale

288 GB
HBM3E Memory
8 TB/s
Bandwidth
1.4 kW
Power Draw

1. Overview

AMD's CDNA 4-based MI350X and MI355X offer clear gains over the MI300 generation. With 288 GB of HBM3E, 8 TB/s of bandwidth, and support for FP4/FP6/FP8, they deliver impressive specs for memory-heavy workloads.

The MI355X is AMD's top-end SKU: 1.4 kW, liquid cooled, and tuned for high sustained throughput. But despite these specs, it's not a competitor to Blackwell at rack scale—at least not yet.


2. MI350X vs MI355X – Performance and Power

Key Insight: The liquid-cooled MI355X draws ~40% more power than the MI350X (1.4 kW vs 1.0 kW) for only a 7–10% uplift in TFLOPs. However, sustained performance is significantly higher due to improved cooling and fewer thermal throttles.

  • MI355X draws ~40% more power than the MI350X (1.4 kW vs 1.0 kW) for only a 7–10% uplift in TFLOPs.
  • But those numbers don't tell the full story: sustained performance is significantly higher on MI355X due to improved cooling and fewer thermal throttles.
  • It delivers up to 10 PFLOPS FP4 and 78.6 TFLOPS FP64, solid for dense inference workloads.

3. Memory Capacity Advantage

Where MI355X truly shines is HBM3E capacity:

  • At 288 GB per GPU, it surpasses GB200 and B200 (192–225 GB) significantly.
  • That makes it ideal for:
    • Multi-model inference (e.g., LLM ensembles or per-tenant serving),
    • Long-context windows (e.g., 64K+ token models),
    • Pinned-memory RAG workflows, where transfer penalties are minimized.

This likely explains early adoption by AWS, GCP, Meta, and Oracle—where inference latency and flexibility drive ROI.

AMD MI355X Performance Analysis

4. The Interconnect Bottleneck

But this is where MI355X hits a wall:

  • Infinity Fabric/XGMI provides ~153 GB/s of bidirectional bandwidth—overclocked PCIe5 links, arranged in a flat mesh.
  • No switching, no domain expansion—just 8 GPUs per node.
  • NVIDIA's NVLink: up to 900 GB/s, with a switched mesh enabling 72–128 GPU rack-scale deployments (NVL72, GB200).

The gap is especially visible in MoE workloads:

Critical Performance Gap: MI355X collective throughput is up to 18× slower than NVL72 for all-to-all operations.


5. Ethernet NICs and Scale-Out Gaps

MI355X nodes support 400 GbE, just like NVIDIA today.

But the landscape is moving fast:

  • 800 GbE support is coming to NVIDIA NICs later this year.
  • AMD's next-gen NIC ("Vulcano") isn't expected until H2 2026—likely limiting MI355X deployments to moderate cluster sizes.

6. Looking Ahead: MI400/Helios and MI500

AMD's roadmap offers promise—but not today:

  • MI400/Helios (2026): 72-GPU rack-scale with UALink (~800 GB/s) and true domain coupling.
  • MI500/UAL256 (2027): 256-GPU scale-out via advanced Ethernet mesh.
  • Helios racks will use double-wide OCP full-depth OAM boards—a far cry from the half-depth testbed clusters being deployed today.

7. Software and Ecosystem

To AMD's credit, its developer relations have entered a "wartime" stance:

  • PyTorch CI/CD support is landing.
  • Compiler tuning and model compatibility are improving.
  • But CUDA/NCCL remains dominant for advanced training and large-scale orchestration.

8. Final Analysis

Factor MI355X NVIDIA (GB200/NVL72)
HBM capacity 288 GB 192–225 GB
Interconnect bandwidth 153 GB/s (XGMI mesh) 900 GB/s (NVLink)
Node scale 8 GPUs 72–128 GPUs (switched mesh)
Ethernet support 400 GbE 400 → 800 GbE (H2 2025)
Rack-scale deployment Not yet Mature
Next-gen systems MI400/Helios in 2026 Already shipping

Strategic Conclusion: MI355X is not a GB200 killer—but it doesn't need to be, yet. It's a compelling inference node for workloads constrained by memory, not bandwidth.

But when it comes to large-model training, MoE, or rack-scale orchestration, the interconnect and NIC bottlenecks remain unsolved—until MI400 and UALink arrive in 2026.

For hyperscalers building clusters today, NVIDIA's vertical stack still wins. But AMD's roadmap has teeth—and this is the most serious challenger we've seen in a decade.