Performance Analytics

>_ Module 04 — Benchmarks, Dashboards, and Scientific Validation

Four typed benchmark runners feed a three-tier logging system, a modular analysis engine rolls that data into metrics with confidence intervals, and Streamlit dashboards turn the result into something a reviewer can read. The runners are the Module 03 API in anger — the analyzers and dashboards are what makes the numbers trustworthy enough to publish.

Benchmark Runners

Four scripts under benchmarks/, each narrow in scope and each backed by the same shared harness. Every runner is a thin Hydra entry point — parse the composed config, build the engine via the Module 03 API, loop over iterations with the clock-locked profiling helper, and write one CSV row per measurement window. No bespoke logging code in any runner; all of it lives in the harness.

run_latency.py
Per-Frame Latency
Submit fixed-size frames under locked GPU clocks, record per-frame latency, emit mean / P50 / P95 / P99 / max. Warmup is isolated and excluded from the steady-state summary.
run_throughput.py
MSPS Under Concurrency
Sustained samples-per-second across channel counts. Scaling curves fall out of the same runner — one multi-run sweep produces the whole 1 / 2 / 4 / 8-channel plot.
run_realtime.py
RTF Validation
Stream at a target sample rate for a fixed wall-clock duration. Report RTF, overrun count, and steady-state queue depth. An RTF under 1.0 is necessary but not sufficient — overruns flag buffer-bound regressions.
run_accuracy.py
Spectral SNR
Compare output spectra against a reference implementation across tone, chirp, and noise inputs. Report SNR in dB; failing tones surface as named failures, not aggregate dips.

Timing Methodology

Benchmark numbers are only as good as the clock behind them. The runners use a hybrid timing strategy plus a set of stability knobs that together took coefficient of variation from 40% on an unturned baseline to 18.7% under locked GPU clocks.

Hybrid Timing
Latency runs use CUDA events — ~5000 isolated measurements where event overhead is negligible and event-based timing captures true kernel cost. Realtime runs use CPU std::chrono — at ~10 kFPS, CUDA-event overhead dominates and warps the result. Two tools, two regimes, one informed choice per runner.
Blocking Sync
cudaDeviceScheduleBlockingSync is set globally in the executor. Without it, cudaStreamSynchronize spin-waits on the CPU and picks up OS scheduler noise. With it, the thread blocks cleanly — CV dropped ~26% the moment this flag landed.
Warmup + Outlier Trim
30% warmup ratio (1500 of 5000 iterations for latency) lets JIT and caches settle. A 1% tail trim on each side removes OS-interrupt spikes. Median and IQR are reported alongside mean and std dev — robust statistics, not just Gaussian ones.
GPU Clock Locking
--lock-clocks pins graphics and memory clocks through the Module 03 profiling helper. Eliminates GPU Boost and thermal throttling noise. Warmup drift falls from ~13 µs to ~2 µs; final CV lands at 18.7% on an RTX 3090 Ti.

Experiment Logging Architecture

Three sinks, one source. The harness emits a single measurement; a fan-out helper lands it in the right place for each consumer. Ephemeral artifacts live under artifacts/ and regenerate on demand; persistent datasets under datasets/ are promoted manually and survive a clean.

01
MLflow — Tracking Server
Every run registers as an experiment with parameters, metrics, and the config hash that produced it. The tracking server is the ground-truth history — compare any two runs by ID without reloading their CSVs.
02
CSV — Dashboard Source of Truth
Flat files under artifacts/data/ using the Module 03 schema. Streamlit reads CSV directly; the analyzers use CSV when MLflow isn't available; the format stays portable across tools.
03
Parquet — Analysis Store
Long-running sweeps are promoted to Parquet for column-oriented queries. The analyzer loads the same columns whether the source is CSV or Parquet — consumers don't care which format is behind the loader.
04
Marker Files — Idempotency
A successful write closes with a marker file. Snakemake reads the markers when rebuilding the DAG — missing marker means the run is incomplete and re-runs. Detail in experiment-logging-system.md.

Modular Analysis Engine

The analyzers sit above the logs. AnalyzerBase handles loading, schema validation, and aggregation; five specialized subclasses each own one angle of the data. AnalysisEngine orchestrates them — pass in a run directory, get back a structured report.

LatencyAnalyzer
Percentile summaries, tail analysis, stage-timing breakdown from the Module 03 helper
ThroughputAnalyzer
MSPS curves across channel counts, scaling fits, concurrency saturation points
RealtimeAnalyzer
RTF statistics, overrun counts, queue-depth steady state over the run window
AccuracyAnalyzer
Spectral SNR per input class, pass / fail against reference tolerances
ScientificMetricsAnalyzer
Frequency resolution, hop analysis, ionosphere suitability scoring
AnalysisEngine
Orchestrator — composes any subset of analyzers, emits a consolidated report object

Statistical Rigor

Benchmark numbers come with uncertainty. The analysis layer is explicit about it — intervals, tests, and effect sizes sit beside every aggregate. Two runs that differ by 2 µs at the mean aren't different if the confidence intervals overlap; the analyzer says so instead of letting a plot imply otherwise.

Confidence Intervals
Bootstrapped CIs on mean and percentile statistics. The interval is reported alongside the point estimate — a naked number is never the answer in this codebase.
Hypothesis Testing
Two-sample Welch's t-test for parametric comparisons, Mann-Whitney U for distribution-free checks. A regression alert is backed by a p-value, not a single fluky run.
Effect Sizes
Cohen's d for standardized mean differences. Statistical significance without effect size is noise dressed up — the analyzer reports both so a 0.2 µs "significant" change doesn't get waved around.
Scaling Fits
Linear, logarithmic, and power-law fits for throughput-vs-channels, latency-vs-NFFT, and similar curves. The reported fit includes residuals and R² so a bad fit doesn't quietly become a conclusion.

Streamlit Dashboard

app.py plus five pages. Each page is a focused view — one executor mode, one application domain, or one cross-cutting concern. The whole app reads CSV from artifacts/data/ on every interaction; a file-watcher reloads when a new run lands so a long sweep populates the UI in place.

BATCH
Batch-executor latency, throughput, and accuracy — direct-memcpy pipeline
STREAMING
Streaming-executor RTF, overrun rate, ring-buffer pressure over time
Ionosphere
Application view — suitability scoring, spectrogram gallery, VLF / ULF use cases
Methods
Methods-paper figures — reproducible plots sourced straight from the analyzers
Config Explorer
Interactive config diff — pick two run IDs, see every parameter that differs
Shared Utils
Cross-page filters, multi-select run comparison, cached loaders over the CSV tree

Validation

Separate from day-to-day benchmarking, the experiments/validation/ folder and a set of CUDA-level correctness choices exist to check the benchmarks themselves. Narrow on purpose — each validation study proves one methodological choice, not a result.

warmup_impact.py
Isolates warmup latency from steady-state latency by sweeping warmup counts and plotting convergence. Answers the question "how many warmup frames are enough" with data rather than a gut number — the 30% warmup ratio in the main runners is justified by this study.
IEEE-754 Numerical Correctness
The magnitude kernel uses hypotf(x, y) instead of sqrtf(x*x + y*y) — IEEE-754-safe against overflow and underflow in intermediates. Compiler flags --fmad=false and --ftz=false disable fused-multiply-add and preserve denormals, so Debug and Release builds produce bit-identical results. Validation against SciPy double-precision (Parseval energy, FFT linearity, per-tone SNR) brought the accuracy-suite pass rate from ~75% to 98.8% and lifted typical SNR into the 130–140 dB range.

RTF in Practice

Module 03 defines RTF. This module interprets it. An RTF value on its own is half an answer — the same 0.01 means "comfortably real-time" for a 100 kHz ionospheric stream and "dangerously close" for a 10 MHz one. The ScientificMetricsAnalyzer is the bridge from raw RTF to a decision.

Headroom vs. Hard Limit
RTF under 1.0 means the pipeline finishes before the next window arrives. RTF under 0.1 means there's room to add channels, stages, or a larger NFFT without losing real-time. The analyzer reports both the absolute RTF and the headroom in those terms.
Time / Frequency Resolution
Bigger NFFT buys frequency resolution at the cost of latency. The analyzer surfaces both numbers side by side so a config choice becomes a legible trade-off, not an invisible one.
Ionosphere Suitability
A domain-specific score combining RTF headroom, frequency resolution, and SNR into one number — a quick "this config is useable for the VLF beacon monitoring task" verdict without reading four plots.
Overrun as Ground Truth
If the streaming runner reports any overruns, the configuration is not real-time — regardless of the mean RTF. Overrun count beats mean latency as a go / no-go signal, and the dashboard surfaces it first.
Access_Source_Repository Return_to_Hub All_Projects