The GB10 Grace Blackwell: 120 GB of Unified Memory for Local LLMs
The NVIDIA GB10 Grace Blackwell Superchip is a strange beast in the local LLM space. It's a server-grade chip — designed for the DGX Spark — running in what NVIDIA is positioning as a "personal AI supercomputer." With 120+ GB of HBM3e unified memory shared between the Grace ARM CPU and the Blackwell GPU, it sits in a different category from every other machine in our benchmark fleet.
We ran 8,860 benchmark rows on it. Here's what we found.
The Memory Architecture Matters More Than You Think
On the RTX 5090, the GPU has 32 GB of GDDR7X. On the Mac mini M4 Pro, the GPU has 64 GB of unified LPDDR5X. On the GB10, the GPU has access to 120+ GB of HBM3e — the same memory type used in H100/H200 data center GPUs.
The practical difference shows up in three places:
-
Model capacity: You can run Q8_0 (near-lossless quality) for any model up to ~90B parameters without quantizing further.
Qwen3.5-27B-Q8_0fits with 70+ GB to spare. -
Context length: At 32 concurrent users, you can maintain 130,000-token contexts — something no other machine in our fleet can sustain.
-
Throughput at scale: Because both CPU and GPU access the same memory pool with no PCIe bottleneck, data transfer overhead is eliminated.
Peak Throughputs by Model Size
| Model | Quant | Single User tok/s | 32 User tok/s |
|---|---|---|---|
| Qwen3.5-0.8B | Q4_0 | ~200 | 395.6 |
| Qwen3.5-9B | Q4_K_M | ~80 | ~210 |
| Qwen3.5-27B | Q4_K_M | ~55 | ~150 |
| Qwen3.6-35B-A3B | MXFP4_MOE | 61.9 | 172.7 |
| Qwen3.6-35B-A3B | Q8_0 | 53.6 | 135.4 |
For large models (27B+), the GB10 keeps throughput competitive even at 32 concurrent users — something the RTX 5090 struggles with at context lengths above 32K.
Where the GB10 Loses
Single-user, small-model throughput is not where the GB10 shines. On Qwen3.5-0.8B, it peaks at 395 tok/s — worse than the RTX 5060 Ti (768 tok/s) and RTX 5090 (much higher).
The reason: small models are bandwidth-bound for weight fetching, but the ratio of weights to activation calculations is so low that raw CUDA core throughput starts to matter — and the GB10's GPU die, while powerful, doesn't have the same density of FP16 CUDA cores as the RTX 5090's full consumer Blackwell die.
The Unified Memory Gotcha
Our early benchmark data from the GB10 had a bug: pynvml (NVIDIA's GPU monitoring library) reported mem.total = 0 for this chip. That's because it's not a discrete GPU with dedicated VRAM — it's unified memory, and the GPU memory pool is managed by the Grace CPU's memory controller, not pynvml's device query.
The fix: when pynvml reports zero VRAM, fall back to system RAM as the VRAM budget. We've patched our benchmark tool and re-migrated all historical GB10 data. You'll now see unified_memory: true and gpu_vram_gb: 120.4 on all GB10 rows in the dataset.
Who Needs a GB10?
The honest answer: not most homelabbers. At $3,000+ for the DGX Spark, you're paying a significant premium over a $2,000 RTX 5090. The memory advantage is real but only matters in specific scenarios:
- You need to run 70B+ models locally
- You're hosting inference for 8+ concurrent users with long contexts
- You want to avoid quantization-induced quality loss on 27B models
If you're a single user running 9B or smaller models interactively, the RTX 5060 Ti at $329 delivers a faster experience at a fraction of the cost.
The GB10 is the right tool when your primary constraint is memory capacity, not purchase price.