Heavy Lifting: Qwen3.5-9B and 27B — Where Architecture Really Matters

Small models are a democracy — any hardware can run them. Large models are a meritocracy — memory bandwidth, VRAM capacity, and compute throughput separate the contenders from the pretenders. At 9B and 27B parameters, the Qwen3.5 family starts making real demands on hardware.

Qwen3.5-9B: The Last Model Everyone Can Run

9B is the sweet spot for capability vs. resource cost — large enough to be genuinely useful, small enough that all three of our test platforms handle it without VRAM spilling. But "handle" covers a wide range.

Peak Throughput

Quant	RTX 5090 (tok/s)	GB10 (tok/s)	M4 Pro (tok/s)
BF16	482.0	82.7	95.2
Q8_0	529.6	123.9	102.0
Q4_0	552.5	140.4	97.0
IQ4_XS	553.5	140.9	95.8
IQ2_XXS-UD	557.7	143.4	102.7
Q3_K_M	546.8	139.8	100.0
Q4_K_M	536.3	136.2	92.8

The RTX 5090 at 557.7 tok/s (IQ2XXS-UD, 32 users) is about 4× the GB10 and 5.4× the M4 Pro. Interestingly, the M4 Pro at 9B (_95–103 tok/s) is not dramatically slower than the GB10 (82–143 tok/s) — the M4 Pro's peak of 102.7 tok/s is within striking distance of the GB10's best.

Wait — the M4 Pro Beats the GB10 at BF16?

Yes. The M4 Pro's 95.2 tok/s at BF16 edges the GB10's 82.7 tok/s. At full precision, the M4 Pro's unified memory bandwidth actually keeps pace with the GB10's shared memory architecture. The GB10 only pulls ahead when quantization reduces memory traffic enough to uncork its compute advantage (143 tok/s with IQ2_XXS-UD).

This tells us something important: the GB10's advantage for 9B models relies on quantization. If you need BF16 quality (for research or fine-tuning purposes), the M4 Pro is actually the better choice between these two.

Single-User Latency (9B Q8_0)

At single user, where responsiveness matters most for chat:

Metric	RTX 5090	GB10	M4 Pro
Throughput	~130 tok/s	~47 tok/s	~30 tok/s
Avg ITL	~7 ms	~21 ms	~32 ms

The RTX 5090 at 7 ms ITL is imperceptible. The M4 Pro at 32 ms ITL is noticeable if you're watching closely — tokens arrive roughly 30 per second, which is readable but not instant. For a single-user chat experience, all three are adequate.

Concurrency: Where 9B Gets Real

At 32 concurrent users:

Platform	Peak 9B tok/s	Avg TTFT
RTX 5090	557.7 tok/s	~265 ms
GB10	143.4 tok/s	~608 ms
M4 Pro	102.7 tok/s	~1,501 ms

The RTX 5090 at 32 concurrent users achieves sub-300ms TTFT — genuinely impressive for a 9B model. The GB10 at 608 ms is reasonable for a multi-user deployment. The M4 Pro's 1.5 seconds is noticeable but still usable.

What's remarkable here is that the TTFT penalty for 9B at 32 users is much lower than for 4B at 32 users on the M4 Pro (1.5s vs 107s). This is likely because the peak throughput configs differ — the 9B numbers are probably using more aggressive quants that genuinely reduce computational overhead.

Qwen3.5-27B: The VRAM Reality Check

27B parameters at BF16 requires roughly 54 GB of memory. That's a problem:

Platform	Available Memory	27B BF16 Fits?	27B Q4 Fits?
RTX 5090	31.8 GB VRAM	No	Yes (barely)
GB10	119.6 GB shared	Yes	Yes
M4 Pro	64 GB unified	Yes	Yes

RTX 5090: Speed vs. Memory

Quant	tok/s	Notes
IQ4_XS	218.0	Best overall
IQ4_NL	217.4
Q4_K_XL-UD	217.2
Q4_0	214.5
Q3_K_S	211.8
Q6_K	186.1	Starting to struggle
Q6_K_XL-UD	55.3	VRAM spill detected
Q8_0	45.9	Spilling hard
Q8_K_XL-UD	22.9	Unusable

There's a cliff at Q6_K: everything below 4-bit quantization fits in the RTX 5090's 31.8 GB and runs at 186–218 tok/s. Anything above spills to system RAM and performance craters by 75–90%. Q8_K_XL-UD at 22.9 tok/s is actually slower than some configurations on the GB10 and M4 Pro.

This is the RTX 5090's Achilles' heel: 31.8 GB of VRAM is a hard wall, and 27B models slam into it.

GB10: Memory for Days, Bandwidth for... Patience

Quant	tok/s	Notes
IQ4_NL	48.1	Best 27B config
IQ4_XS	29.9
Q8_K_XL-UD	16.3	Fits! But slowly.

The GB10's 119.6 GB of shared memory means 27B loads at any quantization level without spilling. But the throughput tells a different story: 48.1 tok/s at best. The shared memory bus simply cannot feed the GPU fast enough for a model this large.

Running 27B at Q8_K_XL-UD on the GB10 is a flex — it's one of the few consumer-ish platforms where this is possible — but at 16.3 tok/s, you'll be waiting a while.

M4 Pro: Surprisingly Present

The M4 Pro's only 27B data point is IQ4_NL at 31.2 tok/s peak (32 concurrent users) with a 7.5-second average TTFT. That's not fast, but it's not nothing either. For batch inference or tolerant applications, 31 tok/s on a laptop chip is noteworthy.

27B Cross-Platform Comparison

Platform	Best 27B tok/s	Best Quant	Can Run Q8?
RTX 5090	218.0	IQ4_XS	Yes, but 45.9 tok/s (spill)
GB10	48.1	IQ4_NL	Yes, 16.3 tok/s
M4 Pro	31.2	IQ4_NL	Limited data

The RTX 5090's advantage at 27B IQ4_XS (218 tok/s) is 4.5× the GB10 — but only because the model fits in VRAM at that quantization. The moment you need Q6+ quality, the GB10's unlimited memory wins by default.

The Architecture Story

Three platforms, three philosophies:

RTX 5090: "Give me a model that fits in 31.8 GB and I'll run it at absurd speed." ≤9B models are a playground; 9B is ~558 tok/s. 27B at IQ4 is still a formidable 218 tok/s. But go above Q4 quantization at 27B and performance collapses.

GB10: "I'll run anything, at a moderate pace." The 119.6 GB memory pool means no model in the Qwen3.5 family causes VRAM pressure. The trade-off is throughput — the shared memory bus limits peak performance to roughly 1/4 the RTX 5090's speed at comparable quants.

M4 Pro: "I'm efficient, quiet, and good enough for one person." 64 GB of unified memory handles 9B comfortably and 27B in a pinch. Throughput is the lowest of the three but adequate for single-user interactive use through 9B. The power efficiency per tok/s is unmatched.

Recommendations for Large Models

9B deployment: If you have an RTX 5090, use it — 558 tok/s at 32 concurrent users is a production-grade number. The GB10 and M4 Pro are fine for single-user interactive inference but can't scale to multi-user serving.

27B on a budget: The GB10 is the only platform that can run 27B at high quant levels without spilling. If model quality matters more than speed, 48 tok/s on the GB10 beats the RTX 5090's spill-degraded 22.9 tok/s at Q8_K_XL.

27B for max speed: The RTX 5090 at IQ4_XS (218 tok/s) is the throughput king — if you accept the quality trade-off of 4-bit quantization. For many use cases (summarization, classification, structured extraction), this is a perfectly acceptable trade.

Don't run 27B on the M4 Pro for multi-user workloads. It technically works, but 31 tok/s with 7.5-second TTFT at 32 users is a patience test.

Benchmarks from Poor Paul's Benchmark using llama-server. Check the Leaderboard for the latest community results.