← Back to articles

Qwen3.6-35B-A3B Across Five Machines: The MoE Architecture Test

qwen3.6moecross-machinearchitecturegb10rtx-5090m4-pro

Qwen3.6-35B-A3B is a Mixture-of-Experts model: 35B total parameters, but only ~3B are active on any given token. On paper this means you get 35B-equivalent quality while paying the compute cost of a 3B model. In practice, it means MoE inference has a different performance profile than dense models — and it behaves differently depending on your hardware.

We ran it on all five machines in our benchmark fleet. Here's what we found.

Hardware Setup

MachineGPUVRAMSpecial Characteristic
GB10Grace Blackwell Superchip120 GB (unified)HBM3e bandwidth
Mac mini M4 ProApple M4 Pro64 GB (unified)Shared CPU/GPU bandwidth
RTX 5060 TiGeForce RTX 5060 Ti16 GBGDDR7 bandwidth
RTX 4060 Ti ×22× GeForce RTX 4060 Ti32 GBDual-GPU tensor split
RTX 5090GeForce RTX 509032 GBGDDR7X bandwidth

Single-User Throughput

At 1 concurrent user with a 2,048-token context, the GB10 surprises everyone:

MachineQuanttok/s (1 user)TTFT
GB10MXFP4_MOE61.988 ms
GB10Q8_053.6103 ms
RTX 5090Q4_K_M~48–55*~120 ms
RTX 4060 Ti ×2Q4_K_M~35–45*~200 ms
Mac mini M4 ProQ4_K_M~28–35*~350 ms

*Estimated from fleet averages — these machines were benchmarked on different dates with slightly different configurations.

The GB10 leads on single-user throughput despite using a "mere" MXFP4 quantization. The reason: HBM3e memory bandwidth. The active experts' weights are fetched on every token, and HBM3e delivers that bandwidth far more efficiently than GDDR systems.

Multi-User Scaling

At 32 concurrent users, the picture changes:

MachineQuanttok/s (32 users)Avg TTFT (32u)
GB10MXFP4_MOE172.71,148 ms
GB10Q8_0135.41,278 ms
RTX 5090Q4_K_M~120*~900 ms

The GB10 still leads, but the TTFT degradation is real. At 32 concurrent users asking 2,048-token prompts, first-token latency climbs to over a second — uncomfortable for interactive use, acceptable for batch processing.

The MoE Memory Paradox

Here's the counterintuitive finding: Qwen3.6-35B-A3B fits easily in 16 GB on the RTX 5060 Ti at Q4_K_M quantization (~8 GB loaded), and the 5060 Ti delivers competitive single-user throughput. MoE models are VRAM-efficient in ways that dense models aren't.

But when you push to 8+ concurrent users, the attention cache fills the remaining VRAM quickly, and you start seeing OOM errors or context truncation. The 120 GB of the GB10 means you can run 32 concurrent users with a 130,000-token context and still have room for the OS.

128K Context: Only One Machine Can

One test we ran that most machines couldn't complete: Qwen3.6-35B-A3B-MXFP4_MOE at 130,064-token context with 32 concurrent users.

Only the GB10 managed this. The result: 168 tok/s at that context length — essentially flat compared to 8K context. That's the benefit of HBM3e's bandwidth and the GB10's massive memory pool.

The Verdict

MoE models like Qwen3.6-35B-A3B reward machines with high memory bandwidth and large memory pools more than dense models do. The GB10's unified 120 GB HBM3e architecture makes it uniquely suited for this workload. But for single-user homelab use, even the RTX 5060 Ti's 16 GB is sufficient if you keep context windows modest.

If you're choosing hardware specifically for MoE inference, prioritize memory bandwidth and pool size over raw CUDA core count.