The world of large‑language‑model inference moves fast. Meta’s Llama 4 and DeepSeek's range of models turns yesterday’s “good enough” hardware into today’s bottleneck, so picking the right platform is more strategic than ever.
I compared eight options that keep popping up in various engineering and sales conversations, including consumer RTX GPUs, Apple Silicon, NVIDIA’s H‑series, Groq’s purpose‑built LPU, Cerebras’ wafer‑scale engine, and turnkey DGX workstations.
Each proves valuable in the right niche yet painful in the wrong one.
What matters most?
• Latency targets and peak tokens per second.
• Total cost at the scale you actually expect to hit, not the scale on a slide.
• Memory ceilings that decide whether you spend weekends wrestling with quantization scripts.
Let’s dig in!
Desktop NVIDIA – familiar but fenced in
The RTX 4090 continues to punch above its price on models up to roughly 20 B parameters, clocking ~150 t/s when llama.cpp is tuned well. Once you cross that memory wall the fun stops quickly; 24 GB simply can’t feed 70 B models without shuttling tensors over PCIe. The next‑gen RTX 5090 should lift the ceiling to 32 GB and add FP4, which could trim inference time by half, but do we really want to bet strategic workloads on a single‑GPU desktop box?
Apple M3 Ultra – memory plenty, watts few
Unified memory up to 512 GB is M3 Ultra’s ace. Early numbers suggest 150–200 t/s on 8 B models and perfectly usable speeds on 32 B once Metal kernels mature. Power draw is a gentle sip next to RTX rigs, so continuous workloads look attractive. The trade‑off is raw GPU grunt: Apple lags NVIDIA’s top tier on embarrassingly parallel kernels. If your success metric is “keep it on the desk, keep it quiet, keep it green,” the Ultra wins hearts.
NVIDIA H200 – the gold standard
A single H200 pushes ~375 t/s on DeepSeek 8B, and eight of them inside a DGX will happily juggle multiple 32 B conversations. FP8 plus TensorRT‑LLM keeps utilisation high, though the bill of materials reminds you these cards trace their lineage to data‑center budgets, not lab PCs.
Groq LPU – when milliseconds matter
Groq’s claim to fame is predictable, blazing throughput: ~600 t/s on DeepSeek's 32B LLM, and north of 900 t/s on the 8B model. If you’re serving an interactive chat fleet and room temperature latency is a KPI, Groq deserves a pilot. The flip side? Training isn’t a focus and model availability varies.
Cerebras – wafer scale, warp speed
Cerebras posts the current headline number: ~1500 t/s on DeekSeek 70B. One CS‑3 can host a 20 B model outright; four boxes team up for a 70 B giant. That brute‑force memory bandwidth turns token storms into breezes. Access, however, is typically through their cloud or an on‑prem investment that rivals a regional colo build‑out.
DGX Station & Spark – the lab favourite
Think of DGX as NVIDIA’s kitchen‑sink workstation: masses of coherent memory, quiet enough for an office, priced for teams that live in notebooks full of CUDA kernels. Spark sips only 170 W yet still lands respectable numbers on midsize models. Great for iteration; less great if every penny per token is scrutinized.
Running the numbers – rough guide
Below you will see a normalized view of DeepSeek performance across eight very different silicon accelerators.