← Back to articles

The cmoe Trap: What Actually Happens When You Enable MoE Flags on RTX 5090

analysisqwenrtx-5090moellama-cppflags

The internet is full of advice about tuning -cmoe and -ncmoe to unlock faster MoE routing for Qwen3.6-35B-A3B on local hardware. The PPB dataset — 88 benchmark rows covering both baseline and flag-sweep runs on an RTX 5090 — tells a different story.

The short answer: don't touch either flag. Your baseline is already your best.


The Setup

All results come from the PPB HuggingFace dataset, queried via mcp.poorpaul.dev. Hardware is an NVIDIA GeForce RTX 5090 (31.8 GB GDDR7), running unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-MXFP4_MOE.gguf via llama-server, 1 concurrent user, across four context lengths: 8K, 16K, 32K, and 130K tokens.

The dataset covers:

  • Baseline (no extra flags) — 16 runs, April 2026
  • -cmoe only — 8 runs, May 2026
  • -ncmoe 20 through -ncmoe 90 (in steps of 10) — 64 runs, May 2026

The Full Picture

FlagAvg tok/sAvg TTFT (ms)Avg ITL (ms)Avg powervs Baseline
Baseline (no flags)203.4574.69293 W
-ncmoe 2086.321710.7152 W−58%
-ncmoe 3066.728113.8136 W−67%
-cmoe53.934117.1123 W−74%
-ncmoe 40–90~54~342~17.1~122 W−74%

Every single flag variant makes performance dramatically worse. The best flag configuration (-ncmoe 20) still delivers less than half the throughput of the baseline.


What's Actually Happening

The results reveal something architecturally important about how MXFP4_MOE sits on a 32 GB GPU.

With no flags, Qwen3.6-35B-A3B at MXFP4_MOE fits entirely in the RTX 5090's 31.8 GB of GDDR7. All 128 experts live on-GPU. Routing is essentially free — the GPU already owns every expert, so there is no fetch latency. Result: 203 tok/s with a 57ms TTFT and 4.69ms inter-token latency.

With -cmoe, the flag instructs llama.cpp to use the continuous batching MoE path, which assumes experts are being streamed from CPU RAM rather than fetched from GPU VRAM. On a GPU where everything fits, this is pure overhead — you're adding a batching and coordination layer to a problem that doesn't exist. Result: a 74% throughput collapse and TTFT jumping from 57ms to 341ms.

With -ncmoe 20, you're limiting the continuous batch buffer to 20 expert slots. This partially helps relative to -cmoe (86 tok/s vs 54 tok/s) because it reduces per-step routing overhead. But it still switches to the slower continuous-batching code path — you're getting 86 tok/s when the baseline gets 203 tok/s. It's the best of a bad set of options.

The power numbers tell the story clearly too: the baseline draws 293 W because it's doing real GPU work. The flagged configurations drop to 122–152 W because they're CPU-bottlenecked, leaving the GPU waiting.


Context Length Is Not the Issue

One commonly held belief is that -cmoe/-ncmoe help at long context by keeping the right experts warm. The PPB data refutes this:

Flag8K tok/s16K tok/s32K tok/s130K tok/s
Baseline203203203202
-cmoe53.954.254.254.2
-ncmoe 2085.986.686.786.5

Throughput is essentially flat across all four context lengths for every configuration. The RTX 5090's 1.79 TB/s of memory bandwidth handles extended KV caches without meaningful cost — and it handles the full expert pool without needing any special routing flags.


When Would -cmoe / -ncmoe Actually Help?

These flags were designed for a specific scenario: a MoE model whose expert weights don't fit in GPU VRAM, forcing experts to be streamed from CPU RAM on demand. In that situation, pre-loading N experts into a continuous buffer meaningfully reduces fetch latency per token.

On an RTX 5090 with MXFP4_MOE, you're not in that situation. The model fits. The flags solve a problem you don't have and introduce overhead you can't afford. The correct use case for -ncmoe on this GPU would be running a larger quantization that exceeds VRAM — neither of which applies here.


A Note on How I Got This Wrong

An earlier version of this article concluded that -ncmoe 20 was a "61% speedup" and recommended it as the default. That was wrong, and worth explaining.

The initial query filtered results to runs after May 9, 2026 — which excluded the April 2026 baseline runs entirely. Within the flag-only dataset, ncmoe=20 at 86 tok/s was the fastest option, so the analysis looked internally consistent. But it was missing the baseline that makes the comparison meaningful.

The corrected query returns all 88 rows for this model/GPU combination. The baseline at 203 tok/s shows up in the April data (llm_flags: null). When you include it, the conclusion flips entirely: no flags is the right call.

This is a good reminder that benchmark analysis is only as complete as the dataset it draws from. The PPB infrastructure correctly stores all the data — the error was in the query, not the data.


The Recommended Config

Based on the complete PPB dataset, the optimal single-user setup on an RTX 5090 is:

llama-server \
  -m ./Qwen3.6-35B-A3B-MXFP4_MOE.gguf \
  -c 131072 \
  -ngl 99 \
  -fa \
  --parallel 1 \
  --host 0.0.0.0 --port 8080

No -cmoe. No -ncmoe. Just the model, full GPU offload, flash attention, and a generous context window. You get 203 tok/s at 130K context — and no flag tuning will improve on that on this hardware.


Summary

GoalRecommendation
Maximum speed, single userNo -cmoe or -ncmoe
If forced to use a flag-ncmoe 20 (58% slower than no flags)
-cmoe or -ncmoe ≥ 4074% slower than baseline — avoid

The data is clear: on an RTX 5090 where MXFP4_MOE fits entirely in VRAM, the MoE routing flags are uniformly detrimental. Leave them off.


Data source: Poor Paul's Benchmark — 88 runs of unsloth/Qwen3.6-35B-A3B-MXFP4_MOE on RTX 5090, submitted to HuggingFace paulplee/ppb-results. Benchmark methodology: llama-server runner, ShareGPT-V3 dataset, CUDA 13.1, RTX 5090 31.8 GB.