← Back to articles

Gemma 4 Benchmarked: Google's E2B and E4B MoE Models on RTX 5090

gemma4googlemoeextreme-moertx-5090efficiency

Google's Gemma 4 lineup takes the MoE concept to an extreme: the E2B variant has only 2B active parameters per token despite having many more total parameters. The E4B doubles that to 4B active. For inference, this means you get the memory footprint of a tiny model with the trained capacity of a much larger one.

We ran both Gemma-4-E2B-it and Gemma-4-E4B-it on the RTX 5090. Here's the data.

Throughput Results

ModelQuantUsersPeak tok/sAvg TTFT
Gemma-4-E2B-itQ4_K_M32~420180 ms
Gemma-4-E2B-itQ8_016~310140 ms
Gemma-4-E4B-itQ4_K_M32~280210 ms
Gemma-4-E4B-itQ8_016~195165 ms
Qwen3.5-2BQ4_K_M32~650*150 ms
Qwen3.5-4BQ4_K_M32~480*175 ms

*For comparison — these are dense models at the same *active parameter* count.

The Active-vs-Total Parameter Distinction

Gemma-4-E2B is not a 2B parameter model. It's a much larger model that activates 2B parameters per token. This distinction matters because:

  • VRAM usage: Gemma-4-E2B uses more VRAM than a true 2B model, because all the MoE expert weights must be loaded into memory even if only a few are active per token
  • Throughput: Slower than a true 2B model because more weights are loaded, even though fewer are computed per token
  • Quality: Higher than a true 2B model because the larger pool of experts means broader knowledge

In practice: Gemma-4-E2B runs at dense-2B compute speeds but uses dense-10B+ VRAM and delivers dense-10B+ quality. This is the MoE value proposition.

VRAM Footprint

ModelQuantApprox VRAMFits in 16 GB?
Gemma-4-E2B-itQ4_K_M~8 GB✅ Yes
Gemma-4-E2B-itQ8_0~15 GB✅ Marginal
Gemma-4-E4B-itQ4_K_M~14 GB✅ Marginal
Gemma-4-E4B-itQ8_0~27 GB❌ No

The E2B at Q4_K_M fits on an RTX 5060 Ti (16 GB) and delivers quality well above what you'd expect from an 8 GB model.

The Multi-GPU Question

Unlike Qwen3.6-35B-A3B, Gemma 4's MoE routing is optimized for single-GPU inference. Running it across two GPUs via tensor split incurs the same PCIe penalty as any other split model, and the routing overhead is amplified because expert selection happens at every layer.

Recommendation: Run Gemma 4 on a single GPU with enough VRAM for the full model in VRAM.

Practical Recommendation

If you have an RTX 5060 Ti (16 GB) and want more quality than a true 2B model:

  • Gemma-4-E2B-Q4_K_M fits easily and provides noticeably better reasoning than Qwen3.5-2B

If you have an RTX 5090 (32 GB) and want the best quality-per-throughput:

  • Gemma-4-E4B-Q8_0 delivers near-lossless quality at reasonable throughput

The Extreme-MoE architecture genuinely delivers on its promise: quality that exceeds the compute-class suggestion by the active parameter count. The trade-off is VRAM overhead that exceeds what the active parameter count would suggest.