Three GPUs, One Model: Qwen3.5-0.8B Across RTX 5090, GB10, and M4 Pro

The 0.8B parameter model is the lightest in the Qwen3.5 family — small enough to fit on anything, fast enough to stress-test memory bandwidth. We ran it across three very different platforms to see how architecture shapes performance when the model itself isn't the bottleneck.

The Hardware

Platform	Memory	Architecture	Power Envelope
NVIDIA GeForce RTX 5090	31.8 GB dedicated VRAM	Blackwell, CUDA 13.1	~450W
NVIDIA GB10 (Project DIGITS)	119.6 GB shared	Grace Blackwell, aarch64	~100W
Apple M4 Pro	64 GB unified	Metal 4, arm64	~30W

Three different memory architectures, three different power budgets, three different design philosophies. Let's see what the numbers say.

Single-User Performance

For the interactive chat use case — one person, one conversation — here's how each platform handles 0.8B Q8_0:

Metric	RTX 5090	GB10	M4 Pro
Throughput	395 tok/s	172 tok/s	92 tok/s
Avg TTFT	20 ms	27 ms	40 ms
Avg ITL	2.5 ms	5.7 ms	10.7 ms
p99 TTFT	31 ms	50 ms	127 ms
p99 ITL	3.2 ms	7.0 ms	11.2 ms

At single user, all three platforms are "fast enough" — a 10.7 ms ITL on the M4 Pro means tokens arrive faster than you can read them. But the RTX 5090 is running at 4.3× the throughput of the M4 Pro and 2.3× the GB10.

That 2.5 ms ITL on the RTX 5090 is essentially instantaneous. You'd need specialized equipment to perceive the difference between 2.5 ms and 5.7 ms.

Peak Throughput (Any Configuration)

When we remove the single-user constraint and let each platform find its best concurrency, context, and batch settings:

Quant	RTX 5090	GB10	M4 Pro
BF16	730.0 tok/s	269.9 tok/s	124.0 tok/s
Q8_0	777.3 tok/s	335.4 tok/s	151.5 tok/s
Q4_1	808.5 tok/s	393.0 tok/s	144.8 tok/s
Q4_0	788.6 tok/s	395.6 tok/s	144.0 tok/s
IQ4_NL	786.2 tok/s	386.2 tok/s	146.0 tok/s
Q3_K_S	766.9 tok/s	385.0 tok/s	136.0 tok/s
Q6_K	775.4 tok/s	341.1 tok/s	134.2 tok/s

The RTX 5090 peaks at 808.5 tok/s with Q4_1. The GB10 reaches 395.6 tok/s with Q4_0. The M4 Pro maxes out at 151.5 tok/s with Q8_0.

A few things jump out:

The RTX 5090 is 2× the GB10 and 5× the M4 Pro on peak throughput. Dedicated VRAM bandwidth is king.
The GB10 benefits more from quantization than the other platforms. Its throughput jumps from 270 tok/s (BF16) to 396 tok/s (Q4_0) — a 47% improvement. The smaller quants reduce memory traffic on the shared bus.
The M4 Pro barely budges across quants. BF16 at 124 tok/s vs Q8_0 at 152 tok/s is only a 22% range. The unified memory bandwidth appears to be the fixed ceiling.

Concurrency Scaling

This is where the architectures truly diverge. Here's how throughput and TTFT scale as we add concurrent users (0.8B, best quant, 8k context):

Throughput Under Load

Concurrent Users	RTX 5090 (tok/s)	GB10 (tok/s)	M4 Pro (tok/s)
1	~395	~172	~93
4	~750+	~386+	~146
8	~780+	~360	~144
32	~700	~270	~152

The RTX 5090 scales beautifully from 1 to 4 users, nearly doubling throughput. The GB10 also scales well through 4 users. The M4 Pro? It reaches its ceiling almost immediately and stays flat.

TTFT Under Load

Here's the real differentiator — how long users wait for the first token:

Concurrent Users	RTX 5090 TTFT	GB10 TTFT	M4 Pro TTFT
1	20 ms	27 ms	40 ms
4	~115 ms	~125 ms	~117 ms
8	~1,300 ms	~2,600 ms	~6,900 ms
32	~5,200 ms	~24,000 ms	~42,500 ms

At 32 users, the M4 Pro's TTFT is 42 seconds. That's not a typo. The GB10 isn't great either at 24 seconds. The RTX 5090 keeps it at roughly 5 seconds, which is still noticeable but at least in the range where a UI spinning indicator feels reasonable.

Context Length Scaling

One of the RTX 5090's quiet superpowers: context length barely matters.

At single user with 0.8B BF16, throughput is essentially flat from 8k to 130k context on the RTX 5090 (393–399 tok/s). Same story on the GB10 (~171 tok/s) and M4 Pro (~92 tok/s). The KV cache for 0.8B is small enough that even 130k tokens fit comfortably in all three memory pools.

This changes with larger models, but for 0.8B, you can set n_ctx=131072 on any platform without meaningful throughput penalty.

The Efficiency Angle

Raw throughput doesn't account for the fact that these platforms draw very different amounts of power:

Platform	Peak 0.8B tok/s	Approx. Power Draw	Rough tok/s per Watt
RTX 5090	808	~450W (system)	~1.8
GB10	396	~100W (system)	~4.0
M4 Pro	152	~30W (package)	~5.1

The M4 Pro delivers 5.1 tok/s per watt — nearly 3× the RTX 5090's efficiency. If you're running inference 24/7 on a small model and paying for electricity, that math matters. The GB10 sits in the middle at 4.0 tok/s per watt, which is impressive given its much larger memory pool.

Recommendations

Go with the RTX 5090 if: You need to serve multiple concurrent users, you prioritize TTFT under load, or you plan to scale up to larger models later. The 5090 has headroom for 9B+ models that the M4 Pro can't match.

Go with the GB10 if: You want to run 27B models (or larger) without quantization headaches. The 119.6 GB of shared memory is a unique advantage. The throughput is moderate (395 tok/s peak for 0.8B), but it handles models that would choke the RTX 5090's VRAM.

Go with the M4 Pro if: You're running single-user inference on a laptop or desktop and want a silent, energy-efficient experience. 92 tok/s is plenty fast for one person, and you're not paying for a power-hungry GPU when you step away from the keyboard.

For 0.8B specifically, all three platforms are overkill. The model is so small that the real question is whether you need concurrency — and if you do, the RTX 5090 is the only serious option.

All data from Poor Paul's Benchmark running llama-server. Explore the full dataset on the Leaderboard or chart your own comparisons on the Explore page.