Feature Wishlist: What Would Make PPB + ppb-mcp Actually Great?

The PPB ecosystem has a solid foundation: a reproducible benchmark runner, a structured dataset, and an MCP interface that makes the data queryable by AI assistants. But using it daily as an actual user exposes gaps that have nothing to do with how many GPUs we've tested. These are pure software problems — features that could be built, and would meaningfully improve the experience.

This is that list.

The Stack, Briefly

ppb — benchmark runner (ppb.py): launches llama-server, varies concurrency, measures tok/s + TTFT, writes JSONL
HF Dataset (paulplee/ppb-results): 39,513 rows, 85 columns, single Parquet file
ppb-mcp: MCP server that exposes tools for querying the dataset

The gaps below are organized by layer.

1. ppb Benchmark Runner

Latency Percentiles, Not Just Averages

Right now, avg_ttft_ms is a single average. Under load, latency distributions are almost always skewed — a few slow requests drag the average up while P50 might be fast. Without P50/P90/P99 breakdowns, you can't tell if a machine has consistent low latency or a bimodal distribution where most requests are fast but 10% are glacial.

What to build: Record and store p50_ttft_ms, p90_ttft_ms, p99_ttft_ms alongside the existing average.

GPU Power Draw During Inference

We measure gpu_power_limit_w (the configured cap) but not actual power draw during the benchmark run. The result: we can't compute tokens per watt — the efficiency metric that should matter as much as raw throughput.

What to build: Sample nvidia-smi --query-gpu=power.draw at 1-second intervals during the run, store avg_power_draw_w and peak_power_draw_w in the result.

Warmup Phase

The first few iterations of any llama-server run are slower while weights are paged in and GPU caches warm up. These samples skew the throughput average downward. The runner currently treats all samples equally.

What to build: Add a configurable warmup period (e.g., 10% of total requests) whose results are recorded separately and excluded from the primary throughput calculation.

Flaky Run Detection

Some runs produce throughput numbers that are outliers relative to the same config run at other times — caused by thermal throttling, background processes, or transient cache pressure. These runs are silently included in the dataset, adding noise to comparisons.

What to build: Compare each completed run's throughput against historical runs with the same run_fingerprint. Flag runs where throughput deviates more than 2σ from the historical median as quality_flag = "suspect" in the JSONL output.

Inter-Token Latency (ITL) Distribution

We record avg_itl_ms but not its variance. For streaming applications, a stable ITL matters more than raw throughput — spiky ITL creates a "stuttering" user experience even when aggregate tok/s looks fine.

What to build: Add p90_itl_ms and p99_itl_ms to the result schema.

2. Data Pipeline & Schema

Schema Validation on Ingest

Currently, the flattener silently drops malformed JSONL rows and logs nothing. A run with a schema bug (wrong field names, wrong types, missing required fields) fails silently — the data is just gone.

What to build: A validation step that reports which rows were dropped and why, with a structured error log per source file. Ideally a dry-run mode that reports all validation errors without writing output.

Deduplication Transparency

Fingerprint-based deduplication is opaque. When a row is deduplicated, there's no record of which canonical row "won" or how many duplicates were dropped. For debugging data quality issues, this is frustrating.

What to build: A migrate --explain-dupes mode that outputs a report: for each deduplicated group, which fingerprint, how many rows, which machine/timestamp was kept, and why.

Confidence Intervals in the Dataset

For model-hardware combinations that have multiple runs, the dataset stores individual rows but provides no aggregate statistics. Every downstream consumer (ppb-mcp, articles, the website) has to recompute averages from raw rows.

What to build: A derived Parquet layer (ppb_results_aggregated.parquet) that pre-computes mean_throughput_tok_s, std_throughput_tok_s, sample_count, and confidence intervals for each (machine_id, model_base, quant, n_ctx, concurrent_users) combination. This becomes the fast query layer; the raw rows remain for full access.

3. ppb-mcp Query Interface

This is where the largest feature gap lives. The MCP tools give AI assistants access to the data, but the tools themselves are designed around what the old schema knew about. Several high-value queries are impossible today.

Budget-Aware Recommendation Tool

The most common homelab question is: "What GPU should I buy for X use case at $Y budget?" This requires mapping hardware to current market prices — something no ppb-mcp tool currently does.

What to build: A recommend_hardware tool that accepts budget_usd, target_model, target_quant, concurrent_users, and optionally context_length. It looks up current GPU MSRPs (a small maintained JSON file is sufficient), filters the dataset to configs that fit the budget, and ranks by throughput.

Tokens-Per-Watt Ranking

Once power draw data exists (from the runner fix above), this becomes straightforward. Right now it's impossible.

What to build: A rank_by_efficiency tool that returns hardware sorted by throughput_tok_s / avg_power_draw_w for a given model and quant. This is the metric that should guide always-on homelab deployments.

New Schema Column Queries

The v0.9.0 schema added gpu_compute_capability, gpu_pcie_gen, gpu_pcie_width, gpu_power_limit_w, and unified_memory. None of these are exposed in any ppb-mcp tool today. The questions they enable:

"Does PCIe 5.0 vs 4.0 affect throughput for tensor-split configurations?"
"Which machines have unified memory architecture?"
"What compute capability is needed to run MXFP4 quantizations?"

What to build: Extend get_hardware_configs and compare_hardware tools to accept and return these columns. Add a filter_by_architecture helper that lets AI assistants reason about hardware tiers.

Percentile Queries (P90, Not Just Peak)

Every tool currently returns "peak throughput" — the highest single result for a configuration. Peak is useful for theoretical ceiling calculations, but for capacity planning you want P90: the throughput you can reliably expect 90% of the time.

What to build: Add a percentile parameter to the existing throughput query tools. percentile=90 returns the 90th percentile across all runs for that configuration rather than the max.

Date-Range Filtering

The dataset spans months of runs. A model that was benchmarked on an older llama.cpp version might show different performance than a recent run. Currently, there's no way for an AI assistant to ask "give me only results from the last 30 days."

What to build: Add run_after and run_before ISO date parameters to the core query tools.

"Explain This Result" Context Tool

When ppb-mcp returns a surprising number — say, dual RTX 4060 Ti underperforming a single RTX 5060 Ti — there's no way to ask why. The AI assistant has to guess based on general knowledge.

What to build: A explain_result tool that, given a specific result (machine + model + quant + concurrency), returns structured context: VRAM utilization, PCIe config, whether tensor split was used, split_mode, the run's TTFT vs. throughput ratio, and how this result compares to the P50 for that machine.

4. Website (poorpaul.dev)

Interactive Comparison Table

All article data is static. A reader who wants to compare "Qwen3.5-9B at Q4_K_M across all machines" has to read multiple articles and mentally aggregate. The data to answer that question is in the HF dataset.

What to build: A client-side filterable table component that queries the HF Parquet file directly via DuckDB-WASM. Filters: model family, quant, concurrency, machine. Columns: machine, tok/s, TTFT, VRAM used. Sortable. This is a weekend project that would make the site dramatically more useful.

Auto-Updating Leaderboard

The top throughput numbers on the site are baked into article text. When a new benchmark run beats an existing record, nothing updates automatically.

What to build: A leaderboard page that reads from the HF dataset at build time (or via ISR), showing the all-time top results per model family, with machine, quant, and date. Automatically reflects new data on the next build.

"Match My Hardware" Page

A reader with an RTX 4070 Super and 16 GB VRAM can't use this site efficiently today — their exact hardware isn't in the dataset, and no tool helps them extrapolate.

What to build: A page where a user enters their GPU model (dropdown from a maintained list), VRAM amount, and PCIe generation. The tool finds the closest matching hardware in the dataset by compute capability + VRAM tier, and returns a table of expected throughput for common models — with appropriate disclaimers about interpolation confidence.

Priority Order

If forced to rank these by impact-to-effort ratio:

Aggregated Parquet layer — immediately accelerates every downstream query, low implementation cost
Interactive comparison table on the website — highest user-visible impact, DuckDB-WASM makes it achievable in a single weekend
ppb-mcp percentile queries — changes recommendations from theoretical to practical
ppb-mcp budget-aware recommendation tool — answers the most common homelab question
GPU power draw measurement in the runner — enables the efficiency metric that should drive hardware decisions
Latency percentiles (P50/P90/P99) — makes capacity planning actually possible
Schema validation with error reporting — prevents silent data loss
Auto-updating leaderboard — keeps the site relevant without manual article updates

The stack is genuinely useful today. With these features, it would be hard to compete with.