Three GPUs, One Model: Qwen3.5-0.8B Across RTX 5090, GB10, and M4 Pro
The 0.8B parameter model is the lightest in the Qwen3.5 family — small enough to fit on anything, fast enough to stress-test memory bandwidth. We ran it across three very different platforms to see how architecture shapes performance when the model itself isn't the bottleneck.
The Hardware
| Platform | Memory | Architecture | Power Envelope |
|---|---|---|---|
| NVIDIA GeForce RTX 5090 | 31.8 GB dedicated VRAM | Blackwell, CUDA 13.1 | ~450W |
| NVIDIA GB10 (Project DIGITS) | 119.6 GB shared | Grace Blackwell, aarch64 | ~100W |
| Apple M4 Pro | 64 GB unified | Metal 4, arm64 | ~30W |
Three different memory architectures, three different power budgets, three different design philosophies. Let's see what the numbers say.
Single-User Performance
For the interactive chat use case — one person, one conversation — here's how each platform handles 0.8B Q8_0:
| Metric | RTX 5090 | GB10 | M4 Pro |
|---|---|---|---|
| Throughput | 395 tok/s | 172 tok/s | 92 tok/s |
| Avg TTFT | 20 ms | 27 ms | 40 ms |
| Avg ITL | 2.5 ms | 5.7 ms | 10.7 ms |
| p99 TTFT | 31 ms | 50 ms | 127 ms |
| p99 ITL | 3.2 ms | 7.0 ms | 11.2 ms |
At single user, all three platforms are "fast enough" — a 10.7 ms ITL on the M4 Pro means tokens arrive faster than you can read them. But the RTX 5090 is running at 4.3× the throughput of the M4 Pro and 2.3× the GB10.
That 2.5 ms ITL on the RTX 5090 is essentially instantaneous. You'd need specialized equipment to perceive the difference between 2.5 ms and 5.7 ms.
Peak Throughput (Any Configuration)
When we remove the single-user constraint and let each platform find its best concurrency, context, and batch settings:
| Quant | RTX 5090 | GB10 | M4 Pro |
|---|---|---|---|
| BF16 | 730.0 tok/s | 269.9 tok/s | 124.0 tok/s |
| Q8_0 | 777.3 tok/s | 335.4 tok/s | 151.5 tok/s |
| Q4_1 | 808.5 tok/s | 393.0 tok/s | 144.8 tok/s |
| Q4_0 | 788.6 tok/s | 395.6 tok/s | 144.0 tok/s |
| IQ4_NL | 786.2 tok/s | 386.2 tok/s | 146.0 tok/s |
| Q3_K_S | 766.9 tok/s | 385.0 tok/s | 136.0 tok/s |
| Q6_K | 775.4 tok/s | 341.1 tok/s | 134.2 tok/s |
The RTX 5090 peaks at 808.5 tok/s with Q4_1. The GB10 reaches 395.6 tok/s with Q4_0. The M4 Pro maxes out at 151.5 tok/s with Q8_0.
A few things jump out:
- The RTX 5090 is 2× the GB10 and 5× the M4 Pro on peak throughput. Dedicated VRAM bandwidth is king.
- The GB10 benefits more from quantization than the other platforms. Its throughput jumps from 270 tok/s (BF16) to 396 tok/s (Q4_0) — a 47% improvement. The smaller quants reduce memory traffic on the shared bus.
- The M4 Pro barely budges across quants. BF16 at 124 tok/s vs Q8_0 at 152 tok/s is only a 22% range. The unified memory bandwidth appears to be the fixed ceiling.
Concurrency Scaling
This is where the architectures truly diverge. Here's how throughput and TTFT scale as we add concurrent users (0.8B, best quant, 8k context):
Throughput Under Load
| Concurrent Users | RTX 5090 (tok/s) | GB10 (tok/s) | M4 Pro (tok/s) |
|---|---|---|---|
| 1 | ~395 | ~172 | ~93 |
| 4 | ~750+ | ~386+ | ~146 |
| 8 | ~780+ | ~360 | ~144 |
| 32 | ~700 | ~270 | ~152 |
The RTX 5090 scales beautifully from 1 to 4 users, nearly doubling throughput. The GB10 also scales well through 4 users. The M4 Pro? It reaches its ceiling almost immediately and stays flat.
TTFT Under Load
Here's the real differentiator — how long users wait for the first token:
| Concurrent Users | RTX 5090 TTFT | GB10 TTFT | M4 Pro TTFT |
|---|---|---|---|
| 1 | 20 ms | 27 ms | 40 ms |
| 4 | ~115 ms | ~125 ms | ~117 ms |
| 8 | ~1,300 ms | ~2,600 ms | ~6,900 ms |
| 32 | ~5,200 ms | ~24,000 ms | ~42,500 ms |
At 32 users, the M4 Pro's TTFT is 42 seconds. That's not a typo. The GB10 isn't great either at 24 seconds. The RTX 5090 keeps it at roughly 5 seconds, which is still noticeable but at least in the range where a UI spinning indicator feels reasonable.
Context Length Scaling
One of the RTX 5090's quiet superpowers: context length barely matters.
At single user with 0.8B BF16, throughput is essentially flat from 8k to 130k context on the RTX 5090 (393–399 tok/s). Same story on the GB10 (~171 tok/s) and M4 Pro (~92 tok/s). The KV cache for 0.8B is small enough that even 130k tokens fit comfortably in all three memory pools.
This changes with larger models, but for 0.8B, you can set n_ctx=131072 on any platform without meaningful throughput penalty.
The Efficiency Angle
Raw throughput doesn't account for the fact that these platforms draw very different amounts of power:
| Platform | Peak 0.8B tok/s | Approx. Power Draw | Rough tok/s per Watt |
|---|---|---|---|
| RTX 5090 | 808 | ~450W (system) | ~1.8 |
| GB10 | 396 | ~100W (system) | ~4.0 |
| M4 Pro | 152 | ~30W (package) | ~5.1 |
The M4 Pro delivers 5.1 tok/s per watt — nearly 3× the RTX 5090's efficiency. If you're running inference 24/7 on a small model and paying for electricity, that math matters. The GB10 sits in the middle at 4.0 tok/s per watt, which is impressive given its much larger memory pool.
Recommendations
Go with the RTX 5090 if: You need to serve multiple concurrent users, you prioritize TTFT under load, or you plan to scale up to larger models later. The 5090 has headroom for 9B+ models that the M4 Pro can't match.
Go with the GB10 if: You want to run 27B models (or larger) without quantization headaches. The 119.6 GB of shared memory is a unique advantage. The throughput is moderate (395 tok/s peak for 0.8B), but it handles models that would choke the RTX 5090's VRAM.
Go with the M4 Pro if: You're running single-user inference on a laptop or desktop and want a silent, energy-efficient experience. 92 tok/s is plenty fast for one person, and you're not paying for a power-hungry GPU when you step away from the keyboard.
For 0.8B specifically, all three platforms are overkill. The model is so small that the real question is whether you need concurrency — and if you do, the RTX 5090 is the only serious option.
All data from Poor Paul's Benchmark running llama-server. Explore the full dataset on the Leaderboard or chart your own comparisons on the Explore page.