Articles

Benchmark reports, analysis posts, and project updates.

The cmoe Trap: What Actually Happens When You Enable MoE Flags on RTX 5090

Enabling -cmoe or -ncmoe on Qwen3.6-35B-A3B-MXFP4_MOE tanks throughput by 57–74% on the RTX 5090. The PPB dataset, including baseline runs, shows the correct recommendation: use no flags at all.

analysisqwenrtx-5090moellama-cppflags

RTX 5060 Ti: The New Throughput King for Small Models

The RTX 5060 Ti hits 768 tok/s on Qwen3.5-0.8B — outpacing the dual RTX 4060 Ti and matching last-gen high-end cards. But 16 GB of VRAM is a real ceiling.

rtx-5060-tiqwenthroughputbudget-gpublackwell

RTX 5060 Ti vs RTX 5090: The Real Price-Performance Ratio for LLMs

The RTX 5090 costs 6× more than the RTX 5060 Ti. Is it 6× better for local LLM inference? Our benchmark data gives a definitive answer — and it depends entirely on your model size.

rtx-5060-tirtx-5090price-performancecomparisonblackwell

Qwen3.6-35B-A3B Across Five Machines: The MoE Architecture Test

We ran Qwen3.6-35B-A3B on every machine in our fleet — GB10, Mac mini M4 Pro, dual RTX 4060 Ti, RTX 5060 Ti, and RTX 5090. The results reveal something counterintuitive about Mixture-of-Experts inference.

qwen3.6moecross-machinearchitecturegb10rtx-5090m4-pro

The Quantization Ladder: Every Quant Level Benchmarked on RTX 5090 for Qwen3.5-27B

From BF16 (near-lossless) to IQ2_XS (2-bit imatrix), we benchmarked every practical quantization level for Qwen3.5-27B on the RTX 5090. The throughput curve has a surprise: Q4 beats Q5 and Q6 in real-world server workloads.

quantizationqwen3.5-27brtx-5090quant-comparisonq4-vs-q8

Mac mini M4 Pro: 309 tok/s on 2B Models, and the Limits of Unified Memory

The Mac mini M4 Pro delivers 64 GB of unified memory in a $1,400 package. Our 6,854-row benchmark dataset reveals where it excels and where its shared-bandwidth architecture becomes a ceiling.

mac-minim4-proapple-siliconunified-memoryefficiency

Feature Wishlist: What Would Make PPB + ppb-mcp Actually Great?

An inside look at the software gaps in the PPB benchmarking stack — from the runner itself, through the dataset pipeline, to the ppb-mcp query interface. Not about missing data, but about missing features that could be built right now.

metaroadmapppbppb-mcpbenchmarkingfeature-wishlist

gpt-oss-20b at 1,491 tok/s: What OpenAI's Open Weights Can Do Locally

OpenAI's first open-weight release hits 1,491 tok/s on RTX 5090 — the highest throughput we've ever recorded for a frontier-quality model. Here's what that actually means in practice.

gpt-ossopenairtx-5090throughputfrontier-models

Gemma 4 Benchmarked: Google's E2B and E4B MoE Models on RTX 5090

Gemma 4 uses an Extreme-MoE architecture with only 2B or 4B active parameters at inference time. We benchmarked both variants on RTX 5090 and compared them against same-activation-size dense models.

gemma4googlemoeextreme-moertx-5090efficiency

The GB10 Grace Blackwell: 120 GB of Unified Memory for Local LLMs

NVIDIA's GB10 Grace Blackwell Superchip in the DGX Spark puts 120+ GB of HBM3e unified memory at your disposal. We benchmarked it against the RTX 5090 and Mac mini to find out when the memory advantage actually matters.

gb10grace-blackwellunified-memorydgx-sparkthroughputlarge-models

Dual RTX 4060 Ti: Does 2×16 GB Actually Help for LLMs?

We benchmarked a dual RTX 4060 Ti setup (32 GB combined VRAM via tensor split) against single-GPU alternatives. The results are more nuanced than 'more VRAM = better'.

rtx-4060-tidual-gputensor-splitmulti-gpuvram

DeepSeek-R1-Distill at 469 tok/s: Reasoning Models Are Finally Fast

DeepSeek-R1-Distill-Qwen-32B hits 469 tok/s on RTX 5090 at Q2_K — which means even thinking tokens come fast enough to not be annoying. We compare it against non-reasoning Qwen models at the same size.

deepseekreasoning-modelschain-of-thoughtrtx-5090quantization

Context Length Scaling: How Throughput Holds Up from 2K to 130K Tokens

One of the underrated dimensions in local LLM benchmarking is context window performance. We charted throughput degradation across 2K, 8K, 32K, 65K, and 130K tokens on every machine — and found surprisingly little penalty on some hardware.

context-lengththroughput-scalinglong-contextkv-cachegb10rtx-5090

Unsloth Dynamic vs Standard GGUF: When Mixed-Precision Quantization Pays Off

Unsloth Dynamic (UD) quants promise better quality at low bit widths through mixed-precision. We benchmarked every UD variant against its standard counterpart on the RTX 5090.

analysisqwenrtx-5090quantizationunsloth-dynamic

RTX 5090 Quantization Guide: Every Qwen3.5 Variant Benchmarked

22 quantization formats across 5 model sizes on the RTX 5090 — a comprehensive guide to finding the right quant for your workload.

analysisqwenrtx-5090quantizationguide

Heavy Lifting: Qwen3.5-9B and 27B — Where Architecture Really Matters

At 9B and 27B parameters, hardware choices stop being preferences and start being constraints. Here's how three platforms cope with serious models.

analysisqwenrtx-5090gb10m4-proarchitecture-comparison

Three GPUs, One Model: Qwen3.5-0.8B Across RTX 5090, GB10, and M4 Pro

A head-to-head comparison of Qwen3.5-0.8B inference performance across three architectures — NVIDIA RTX 5090, NVIDIA GB10, and Apple M4 Pro.

analysisqwenrtx-5090gb10m4-proarchitecture-comparison

Scaling Up: Qwen3.5-2B and 4B Across Three Architectures

How do the RTX 5090, GB10, and M4 Pro handle Qwen3.5 at 2B and 4B parameters? The answer depends on more than just raw speed.

analysisqwenrtx-5090gb10m4-proarchitecture-comparison

First Benchmark Roundup: Qwen3.5-0.8B on RTX 5090 vs M4 Pro

Our first look at the data — comparing inference throughput and latency for Qwen3.5-0.8B Q8_0 on an NVIDIA RTX 5090 and Apple M4 Pro.

analysisqwenrtx-5090m4-pro

Welcome to poorpaul.dev

Introducing the official companion site for Poor Paul's Benchmark — the easiest way to explore open LLM inference benchmarks.

announcementopen-source