Heavy Lifting: Qwen3.5-9B and 27B — Where Architecture Really Matters
Small models are a democracy — any hardware can run them. Large models are a meritocracy — memory bandwidth, VRAM capacity, and compute throughput separate the contenders from the pretenders. At 9B and 27B parameters, the Qwen3.5 family starts making real demands on hardware.
Qwen3.5-9B: The Last Model Everyone Can Run
9B is the sweet spot for capability vs. resource cost — large enough to be genuinely useful, small enough that all three of our test platforms handle it without VRAM spilling. But "handle" covers a wide range.
Peak Throughput
| Quant | RTX 5090 (tok/s) | GB10 (tok/s) | M4 Pro (tok/s) |
|---|---|---|---|
| BF16 | 482.0 | 82.7 | 95.2 |
| Q8_0 | 529.6 | 123.9 | 102.0 |
| Q4_0 | 552.5 | 140.4 | 97.0 |
| IQ4_XS | 553.5 | 140.9 | 95.8 |
| IQ2_XXS-UD | 557.7 | 143.4 | 102.7 |
| Q3_K_M | 546.8 | 139.8 | 100.0 |
| Q4_K_M | 536.3 | 136.2 | 92.8 |
The RTX 5090 at 557.7 tok/s (IQ2XXS-UD, 32 users) is about 4× the GB10 and 5.4× the M4 Pro. Interestingly, the M4 Pro at 9B (_95–103 tok/s) is not dramatically slower than the GB10 (82–143 tok/s) — the M4 Pro's peak of 102.7 tok/s is within striking distance of the GB10's best.
Wait — the M4 Pro Beats the GB10 at BF16?
Yes. The M4 Pro's 95.2 tok/s at BF16 edges the GB10's 82.7 tok/s. At full precision, the M4 Pro's unified memory bandwidth actually keeps pace with the GB10's shared memory architecture. The GB10 only pulls ahead when quantization reduces memory traffic enough to uncork its compute advantage (143 tok/s with IQ2_XXS-UD).
This tells us something important: the GB10's advantage for 9B models relies on quantization. If you need BF16 quality (for research or fine-tuning purposes), the M4 Pro is actually the better choice between these two.
Single-User Latency (9B Q8_0)
At single user, where responsiveness matters most for chat:
| Metric | RTX 5090 | GB10 | M4 Pro |
|---|---|---|---|
| Throughput | ~130 tok/s | ~47 tok/s | ~30 tok/s |
| Avg ITL | ~7 ms | ~21 ms | ~32 ms |
The RTX 5090 at 7 ms ITL is imperceptible. The M4 Pro at 32 ms ITL is noticeable if you're watching closely — tokens arrive roughly 30 per second, which is readable but not instant. For a single-user chat experience, all three are adequate.
Concurrency: Where 9B Gets Real
At 32 concurrent users:
| Platform | Peak 9B tok/s | Avg TTFT |
|---|---|---|
| RTX 5090 | 557.7 tok/s | ~265 ms |
| GB10 | 143.4 tok/s | ~608 ms |
| M4 Pro | 102.7 tok/s | ~1,501 ms |
The RTX 5090 at 32 concurrent users achieves sub-300ms TTFT — genuinely impressive for a 9B model. The GB10 at 608 ms is reasonable for a multi-user deployment. The M4 Pro's 1.5 seconds is noticeable but still usable.
What's remarkable here is that the TTFT penalty for 9B at 32 users is much lower than for 4B at 32 users on the M4 Pro (1.5s vs 107s). This is likely because the peak throughput configs differ — the 9B numbers are probably using more aggressive quants that genuinely reduce computational overhead.
Qwen3.5-27B: The VRAM Reality Check
27B parameters at BF16 requires roughly 54 GB of memory. That's a problem:
| Platform | Available Memory | 27B BF16 Fits? | 27B Q4 Fits? |
|---|---|---|---|
| RTX 5090 | 31.8 GB VRAM | No | Yes (barely) |
| GB10 | 119.6 GB shared | Yes | Yes |
| M4 Pro | 64 GB unified | Yes | Yes |
RTX 5090: Speed vs. Memory
| Quant | tok/s | Notes |
|---|---|---|
| IQ4_XS | 218.0 | Best overall |
| IQ4_NL | 217.4 | |
| Q4_K_XL-UD | 217.2 | |
| Q4_0 | 214.5 | |
| Q3_K_S | 211.8 | |
| Q6_K | 186.1 | Starting to struggle |
| Q6_K_XL-UD | 55.3 | VRAM spill detected |
| Q8_0 | 45.9 | Spilling hard |
| Q8_K_XL-UD | 22.9 | Unusable |
There's a cliff at Q6_K: everything below 4-bit quantization fits in the RTX 5090's 31.8 GB and runs at 186–218 tok/s. Anything above spills to system RAM and performance craters by 75–90%. Q8_K_XL-UD at 22.9 tok/s is actually slower than some configurations on the GB10 and M4 Pro.
This is the RTX 5090's Achilles' heel: 31.8 GB of VRAM is a hard wall, and 27B models slam into it.
GB10: Memory for Days, Bandwidth for... Patience
| Quant | tok/s | Notes |
|---|---|---|
| IQ4_NL | 48.1 | Best 27B config |
| IQ4_XS | 29.9 | |
| Q8_K_XL-UD | 16.3 | Fits! But slowly. |
The GB10's 119.6 GB of shared memory means 27B loads at any quantization level without spilling. But the throughput tells a different story: 48.1 tok/s at best. The shared memory bus simply cannot feed the GPU fast enough for a model this large.
Running 27B at Q8_K_XL-UD on the GB10 is a flex — it's one of the few consumer-ish platforms where this is possible — but at 16.3 tok/s, you'll be waiting a while.
M4 Pro: Surprisingly Present
The M4 Pro's only 27B data point is IQ4_NL at 31.2 tok/s peak (32 concurrent users) with a 7.5-second average TTFT. That's not fast, but it's not nothing either. For batch inference or tolerant applications, 31 tok/s on a laptop chip is noteworthy.
27B Cross-Platform Comparison
| Platform | Best 27B tok/s | Best Quant | Can Run Q8? |
|---|---|---|---|
| RTX 5090 | 218.0 | IQ4_XS | Yes, but 45.9 tok/s (spill) |
| GB10 | 48.1 | IQ4_NL | Yes, 16.3 tok/s |
| M4 Pro | 31.2 | IQ4_NL | Limited data |
The RTX 5090's advantage at 27B IQ4_XS (218 tok/s) is 4.5× the GB10 — but only because the model fits in VRAM at that quantization. The moment you need Q6+ quality, the GB10's unlimited memory wins by default.
The Architecture Story
Three platforms, three philosophies:
RTX 5090: "Give me a model that fits in 31.8 GB and I'll run it at absurd speed." ≤9B models are a playground; 9B is ~558 tok/s. 27B at IQ4 is still a formidable 218 tok/s. But go above Q4 quantization at 27B and performance collapses.
GB10: "I'll run anything, at a moderate pace." The 119.6 GB memory pool means no model in the Qwen3.5 family causes VRAM pressure. The trade-off is throughput — the shared memory bus limits peak performance to roughly 1/4 the RTX 5090's speed at comparable quants.
M4 Pro: "I'm efficient, quiet, and good enough for one person." 64 GB of unified memory handles 9B comfortably and 27B in a pinch. Throughput is the lowest of the three but adequate for single-user interactive use through 9B. The power efficiency per tok/s is unmatched.
Recommendations for Large Models
9B deployment: If you have an RTX 5090, use it — 558 tok/s at 32 concurrent users is a production-grade number. The GB10 and M4 Pro are fine for single-user interactive inference but can't scale to multi-user serving.
27B on a budget: The GB10 is the only platform that can run 27B at high quant levels without spilling. If model quality matters more than speed, 48 tok/s on the GB10 beats the RTX 5090's spill-degraded 22.9 tok/s at Q8_K_XL.
27B for max speed: The RTX 5090 at IQ4_XS (218 tok/s) is the throughput king — if you accept the quality trade-off of 4-bit quantization. For many use cases (summarization, classification, structured extraction), this is a perfectly acceptable trade.
Don't run 27B on the M4 Pro for multi-user workloads. It technically works, but 31 tok/s with 7.5-second TTFT at 32 users is a patience test.
Benchmarks from Poor Paul's Benchmark using llama-server. Check the Leaderboard for the latest community results.