The cmoe Trap: What Actually Happens When You Enable MoE Flags on RTX 5090
The internet is full of advice about tuning -cmoe and -ncmoe to unlock faster MoE routing for Qwen3.6-35B-A3B on local hardware. The PPB dataset — 88 benchmark rows covering both baseline and flag-sweep runs on an RTX 5090 — tells a different story.
The short answer: don't touch either flag. Your baseline is already your best.
The Setup
All results come from the PPB HuggingFace dataset, queried via mcp.poorpaul.dev. Hardware is an NVIDIA GeForce RTX 5090 (31.8 GB GDDR7), running unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-MXFP4_MOE.gguf via llama-server, 1 concurrent user, across four context lengths: 8K, 16K, 32K, and 130K tokens.
The dataset covers:
- Baseline (no extra flags) — 16 runs, April 2026
-cmoeonly — 8 runs, May 2026-ncmoe 20through-ncmoe 90(in steps of 10) — 64 runs, May 2026
The Full Picture
| Flag | Avg tok/s | Avg TTFT (ms) | Avg ITL (ms) | Avg power | vs Baseline |
|---|---|---|---|---|---|
| Baseline (no flags) | 203.4 | 57 | 4.69 | 293 W | — |
-ncmoe 20 | 86.3 | 217 | 10.7 | 152 W | −58% |
-ncmoe 30 | 66.7 | 281 | 13.8 | 136 W | −67% |
-cmoe | 53.9 | 341 | 17.1 | 123 W | −74% |
-ncmoe 40–90 | ~54 | ~342 | ~17.1 | ~122 W | −74% |
Every single flag variant makes performance dramatically worse. The best flag configuration (-ncmoe 20) still delivers less than half the throughput of the baseline.
What's Actually Happening
The results reveal something architecturally important about how MXFP4_MOE sits on a 32 GB GPU.
With no flags, Qwen3.6-35B-A3B at MXFP4_MOE fits entirely in the RTX 5090's 31.8 GB of GDDR7. All 128 experts live on-GPU. Routing is essentially free — the GPU already owns every expert, so there is no fetch latency. Result: 203 tok/s with a 57ms TTFT and 4.69ms inter-token latency.
With -cmoe, the flag instructs llama.cpp to use the continuous batching MoE path, which assumes experts are being streamed from CPU RAM rather than fetched from GPU VRAM. On a GPU where everything fits, this is pure overhead — you're adding a batching and coordination layer to a problem that doesn't exist. Result: a 74% throughput collapse and TTFT jumping from 57ms to 341ms.
With -ncmoe 20, you're limiting the continuous batch buffer to 20 expert slots. This partially helps relative to -cmoe (86 tok/s vs 54 tok/s) because it reduces per-step routing overhead. But it still switches to the slower continuous-batching code path — you're getting 86 tok/s when the baseline gets 203 tok/s. It's the best of a bad set of options.
The power numbers tell the story clearly too: the baseline draws 293 W because it's doing real GPU work. The flagged configurations drop to 122–152 W because they're CPU-bottlenecked, leaving the GPU waiting.
Context Length Is Not the Issue
One commonly held belief is that -cmoe/-ncmoe help at long context by keeping the right experts warm. The PPB data refutes this:
| Flag | 8K tok/s | 16K tok/s | 32K tok/s | 130K tok/s |
|---|---|---|---|---|
| Baseline | 203 | 203 | 203 | 202 |
-cmoe | 53.9 | 54.2 | 54.2 | 54.2 |
-ncmoe 20 | 85.9 | 86.6 | 86.7 | 86.5 |
Throughput is essentially flat across all four context lengths for every configuration. The RTX 5090's 1.79 TB/s of memory bandwidth handles extended KV caches without meaningful cost — and it handles the full expert pool without needing any special routing flags.
When Would -cmoe / -ncmoe Actually Help?
These flags were designed for a specific scenario: a MoE model whose expert weights don't fit in GPU VRAM, forcing experts to be streamed from CPU RAM on demand. In that situation, pre-loading N experts into a continuous buffer meaningfully reduces fetch latency per token.
On an RTX 5090 with MXFP4_MOE, you're not in that situation. The model fits. The flags solve a problem you don't have and introduce overhead you can't afford. The correct use case for -ncmoe on this GPU would be running a larger quantization that exceeds VRAM — neither of which applies here.
A Note on How I Got This Wrong
An earlier version of this article concluded that -ncmoe 20 was a "61% speedup" and recommended it as the default. That was wrong, and worth explaining.
The initial query filtered results to runs after May 9, 2026 — which excluded the April 2026 baseline runs entirely. Within the flag-only dataset, ncmoe=20 at 86 tok/s was the fastest option, so the analysis looked internally consistent. But it was missing the baseline that makes the comparison meaningful.
The corrected query returns all 88 rows for this model/GPU combination. The baseline at 203 tok/s shows up in the April data (llm_flags: null). When you include it, the conclusion flips entirely: no flags is the right call.
This is a good reminder that benchmark analysis is only as complete as the dataset it draws from. The PPB infrastructure correctly stores all the data — the error was in the query, not the data.
The Recommended Config
Based on the complete PPB dataset, the optimal single-user setup on an RTX 5090 is:
llama-server \
-m ./Qwen3.6-35B-A3B-MXFP4_MOE.gguf \
-c 131072 \
-ngl 99 \
-fa \
--parallel 1 \
--host 0.0.0.0 --port 8080
No -cmoe. No -ncmoe. Just the model, full GPU offload, flash attention, and a generous context window. You get 203 tok/s at 130K context — and no flag tuning will improve on that on this hardware.
Summary
| Goal | Recommendation |
|---|---|
| Maximum speed, single user | No -cmoe or -ncmoe |
| If forced to use a flag | -ncmoe 20 (58% slower than no flags) |
-cmoe or -ncmoe ≥ 40 | 74% slower than baseline — avoid |
The data is clear: on an RTX 5090 where MXFP4_MOE fits entirely in VRAM, the MoE routing flags are uniformly detrimental. Leave them off.
Data source: Poor Paul's Benchmark — 88 runs of unsloth/Qwen3.6-35B-A3B-MXFP4_MOE on RTX 5090, submitted to HuggingFace paulplee/ppb-results. Benchmark methodology: llama-server runner, ShareGPT-V3 dataset, CUDA 13.1, RTX 5090 31.8 GB.