Scaling Up: Qwen3.5-2B and 4B Across Three Architectures

The 0.8B model was easy — every platform handled it without breaking a sweat. But bump up to 2B and 4B parameters and the hardware differences start to matter. Memory bandwidth, VRAM capacity, and compute efficiency all come into sharper focus.

Qwen3.5-2B: The Sweet Spot

Peak Throughput

Quant	RTX 5090 (tok/s)	GB10 (tok/s)	M4 Pro (tok/s)
BF16	518.8	160.3	264.5
Q8_0	678.0	229.6	276.3
Q4_0	744.2	293.1	277.4
IQ4_NL	759.5	283.4	277.7
IQ2_XXS-UD	679.8	320.1	309.0
Q3_K_S	686.7	283.6	283.3

Something interesting happens at 2B: the M4 Pro closes the gap with the GB10. The M4 Pro's best quant (IQ2_XXS-UD at 309 tok/s) is within 3.5% of the GB10's best (320.1 tok/s). That's remarkable for a platform with half the memory and a fraction of the power budget.

The RTX 5090 still leads by a wide margin at 759.5 tok/s (IQ4_NL), but the 2B model doesn't need the 5090's firepower the way larger models will.

Why IQ2_XXS Dominates on GB10 and M4 Pro

The aggressive IQ2_XXS quantization (from Unsloth Dynamic) shrinks the 2B model small enough that memory bus traffic drops dramatically. On the GB10's shared memory bus and the M4 Pro's unified memory, less traffic means more bandwidth available for actual computation. The RTX 5090 doesn't benefit as much because its dedicated VRAM bandwidth was never the bottleneck at 2B.

This is a hint at a broader pattern: on bandwidth-constrained platforms, aggressive quantization is a performance optimization, not just a size optimization.

Single-User Latency (2B Q8_0)

Metric	RTX 5090	GB10	M4 Pro
Throughput	~280 tok/s	~110 tok/s	~47 tok/s
Avg TTFT	~35 ms	~50 ms	~100 ms
Avg ITL	~3.5 ms	~9 ms	~21 ms

At single user, the RTX 5090 delivers responsiveness that's hard to distinguish from instant. The M4 Pro's 21 ms ITL is still smooth for interactive use — tokens appear faster than you can read.

Qwen3.5-4B: Where Things Get Interesting

At 4B parameters, the M4 Pro starts to show real strain.

Peak Throughput

Quant	RTX 5090 (tok/s)	GB10 (tok/s)	M4 Pro (tok/s)
BF16	313.2	72.5	45.0
Q8_0	393.8	105.0	70.8
Q4_0	446.6	133.8	66.3
IQ4_NL	434.5	132.9	50.5
IQ2_XXS-UD	428.7	150.1	67.2
Q4_K_M	360.2	122.5	60.1

The RTX 5090 at 446.6 tok/s (Q4_0) is now 3× the GB10 (150 tok/s) and 6× the M4 Pro (70.8 tok/s). The performance gap has widened considerably from what we saw at 0.8B and 2B.

The M4 Pro's TTFT Problem

Where the 4B model really punishes the M4 Pro is under concurrency. At 32 users:

Platform	Peak 4B tok/s @ 32 users	Avg TTFT @ 32 users
RTX 5090	~386 tok/s	~16,753 ms
GB10	~123 tok/s	~22,965 ms
M4 Pro	~60 tok/s	~107,141 ms

That's right: the M4 Pro's average TTFT at 32 users with the 4B model is 107 seconds — nearly two minutes. Even the RTX 5090 climbs to 17 seconds, which is borderline for interactive use. The takeaway: 4B at 32 concurrent users is a heavy workload and none of these platforms handle it gracefully, but the M4 Pro is in a different universe of pain.

GB10's BF16 Disadvantage

The GB10 shows a stark BF16-to-quant gap at 4B: 72.5 tok/s with BF16 vs 150.1 tok/s with IQ2_XXS-UD. That's a 2.1× improvement from quantization alone. The shared memory bus clearly benefits from the reduced data movement. If you're running 4B on the GB10, quantize aggressively — the throughput gains are too large to ignore.

Context Window Handling

At 4B, context length starts to matter more:

Platform	Best 4B n_ctx	Notes
RTX 5090	130,064	No degradation through full context
GB10	130,064	Handles it, but memory sharing impacts concurrency
M4 Pro	130,064	Loads, but combines with concurrency to cause extreme TTFT

All three platforms can technically handle 130k context with 4B, but "can handle" and "should use" are different questions. On the M4 Pro, 130k context + even moderate concurrency produces unusable TTFT.

The 2B-to-4B Cliff

The jump from 2B to 4B creates a dramatic shift in the competitive landscape:

Model	RTX 5090 Peak	GB10 Peak	M4 Pro Peak	M4 Pro as % of RTX
2B	759 tok/s	320 tok/s	309 tok/s	41%
4B	447 tok/s	150 tok/s	71 tok/s	16%

The M4 Pro goes from 41% of RTX 5090 throughput at 2B to just 16% at 4B. This is the inflection point where Apple's unified memory architecture hits its bandwidth ceiling for LLM inference.

Recommendations

For 2B: All three platforms are viable. The RTX 5090 is overkill unless you need 8+ concurrent users. The M4 Pro is a legitimate choice — 309 tok/s peak throughput is faster than the GB10 at some quant levels, and it draws a fraction of the power.

For 4B: The RTX 5090 pulls ahead decisively. The GB10 is a reasonable second choice if you value its memory headroom for larger models. The M4 Pro still works for single-user interactive chat at 4B, but don't try to serve multiple users — TTFT will make them very unhappy.

The quantization lesson: On bandwidth-limited platforms (GB10, M4 Pro), aggressive quantization isn't just about saving memory — it's a direct throughput multiplier. Don't default to Q8_0 "for quality" on these platforms; the performance cost is significant.

Data from Poor Paul's Benchmark. See more on the Leaderboard or explore the raw data on the Explore page.