Performance Analytics

>_ Module 04 — Benchmarks, Dashboards, and Scientific Validation

Four typed benchmark runners feed a three-tier logging system, a modular analysis engine rolls that data into metrics with confidence intervals, and Streamlit dashboards turn the result into something a reviewer can read. The runners are the Module 03 API in anger — the analyzers and dashboards are what makes the numbers trustworthy enough to publish.

Benchmark Runners

Four scripts under benchmarks/, each narrow in scope and each backed by the same shared harness. Every runner is a thin Hydra entry point — parse the composed config, build the engine via the Module 03 API, loop over iterations with the clock-locked profiling helper, and write one CSV row per measurement window. No bespoke logging code in any runner; all of it lives in the harness.

run_latency.py

Per-Frame Latency

Submit fixed-size frames under locked GPU clocks, record per-frame latency, emit mean / P50 / P95 / P99 / max. Warmup is isolated and excluded from the steady-state summary.

run_throughput.py

MSPS Under Concurrency

Sustained samples-per-second across channel counts. Scaling curves fall out of the same runner — one multi-run sweep produces the whole 1 / 2 / 4 / 8-channel plot.

run_realtime.py

RTF Validation

Stream at a target sample rate for a fixed wall-clock duration. Report RTF, overrun count, and steady-state queue depth. An RTF under 1.0 is necessary but not sufficient — overruns flag buffer-bound regressions.

run_accuracy.py

Spectral SNR

Compare output spectra against a reference implementation across tone, chirp, and noise inputs. Report SNR in dB; failing tones surface as named failures, not aggregate dips.

Timing Methodology

Benchmark numbers are only as good as the clock behind them. The runners use a hybrid timing strategy plus a set of stability knobs that together took coefficient of variation from 40% on an unturned baseline to 18.7% under locked GPU clocks.

Hybrid Timing

Latency runs use CUDA events — ~5000 isolated
                        measurements where event overhead is negligible and event-based timing
                        captures true kernel cost. Realtime runs use
CPU std::chrono — at ~10 kFPS, CUDA-event
                        overhead dominates and warps the result. Two tools, two regimes, one
                        informed choice per runner.
Blocking Sync
 cudaDeviceScheduleBlockingSync is set globally in the
                        executor. Without it, cudaStreamSynchronize spin-waits on
                        the CPU and picks up OS scheduler noise. With it, the thread blocks
                        cleanly — CV dropped ~26% the moment this flag landed.
Warmup + Outlier Trim
 30% warmup ratio (1500 of 5000 iterations for latency)
                        lets JIT and caches settle. A 1% tail trim on each side
                        removes OS-interrupt spikes. Median and IQR are reported alongside mean
                        and std dev — robust statistics, not just Gaussian ones.
GPU Clock Locking
 --lock-clocks pins graphics and memory clocks through the
                        Module 03 profiling helper. Eliminates GPU Boost and thermal throttling
                        noise. Warmup drift falls from ~13 µs to ~2 µs; final CV lands at
18.7% on an RTX 3090 Ti.

Experiment Logging Architecture

Three sinks, one source. The harness emits a single measurement; a fan-out helper lands it in the right place for each consumer. Ephemeral artifacts live under artifacts/ and regenerate on demand; persistent datasets under datasets/ are promoted manually and survive a clean.

MLflow — Tracking Server

Every run registers as an experiment with parameters, metrics, and the config hash that produced it. The tracking server is the ground-truth history — compare any two runs by ID without reloading their CSVs.

CSV — Dashboard Source of Truth

Flat files under artifacts/data/ using the Module 03 schema. Streamlit reads CSV directly; the analyzers use CSV when MLflow isn't available; the format stays portable across tools.

Parquet — Analysis Store

Long-running sweeps are promoted to Parquet for column-oriented queries. The analyzer loads the same columns whether the source is CSV or Parquet — consumers don't care which format is behind the loader.

Marker Files — Idempotency

A successful write closes with a marker file. Snakemake reads the markers when rebuilding the DAG — missing marker means the run is incomplete and re-runs. Detail in experiment-logging-system.md.

Modular Analysis Engine

The analyzers sit above the logs. AnalyzerBase handles loading, schema validation, and aggregation; five specialized subclasses each own one angle of the data. AnalysisEngine orchestrates them — pass in a run directory, get back a structured report.

LatencyAnalyzer

Percentile summaries, tail analysis, stage-timing breakdown from the Module 03 helper

ThroughputAnalyzer

MSPS curves across channel counts, scaling fits, concurrency saturation points

RealtimeAnalyzer

RTF statistics, overrun counts, queue-depth steady state over the run window

AccuracyAnalyzer

Spectral SNR per input class, pass / fail against reference tolerances

ScientificMetricsAnalyzer

Frequency resolution, hop analysis, ionosphere suitability scoring

AnalysisEngine

Orchestrator — composes any subset of analyzers, emits a consolidated report object

Statistical Rigor

Benchmark numbers come with uncertainty. The analysis layer is explicit about it — intervals, tests, and effect sizes sit beside every aggregate. Two runs that differ by 2 µs at the mean aren't different if the confidence intervals overlap; the analyzer says so instead of letting a plot imply otherwise.

Confidence Intervals

Bootstrapped CIs on mean and percentile statistics. The interval is
                        reported alongside the point estimate — a naked number is never the
                        answer in this codebase.
Hypothesis Testing

Two-sample Welch's t-test for parametric comparisons,
Mann-Whitney U for distribution-free checks. A regression
                        alert is backed by a p-value, not a single fluky run.
Effect Sizes
 Cohen's d for standardized mean differences. Statistical
                        significance without effect size is noise dressed up — the analyzer
                        reports both so a 0.2 µs "significant" change doesn't get waved around.
Scaling Fits

Linear, logarithmic, and power-law fits for throughput-vs-channels,
                        latency-vs-NFFT, and similar curves. The reported fit includes residuals
                        and R² so a bad fit doesn't quietly become a conclusion.

Streamlit Dashboard

app.py plus five pages. Each page is a focused view — one executor mode, one application domain, or one cross-cutting concern. The whole app reads CSV from artifacts/data/ on every interaction; a file-watcher reloads when a new run lands so a long sweep populates the UI in place.

BATCH

Batch-executor latency, throughput, and accuracy — direct-memcpy pipeline

STREAMING

Streaming-executor RTF, overrun rate, ring-buffer pressure over time

Ionosphere

Application view — suitability scoring, spectrogram gallery, VLF / ULF use cases

Methods

Methods-paper figures — reproducible plots sourced straight from the analyzers

Config Explorer

Interactive config diff — pick two run IDs, see every parameter that differs

Shared Utils

Cross-page filters, multi-select run comparison, cached loaders over the CSV tree

Validation

Separate from day-to-day benchmarking, the experiments/validation/ folder and a set of CUDA-level correctness choices exist to check the benchmarks themselves. Narrow on purpose — each validation study proves one methodological choice, not a result.

warmup_impact.py

Isolates warmup latency from steady-state latency by sweeping warmup counts and plotting convergence. Answers the question "how many warmup frames are enough" with data rather than a gut number — the 30% warmup ratio in the main runners is justified by this study.

IEEE-754 Numerical Correctness

The magnitude kernel uses hypotf(x, y) instead of sqrtf(x*x + y*y) — IEEE-754-safe against overflow and underflow in intermediates. Compiler flags --fmad=false and --ftz=false disable fused-multiply-add and preserve denormals, so Debug and Release builds produce bit-identical results. Validation against SciPy double-precision (Parseval energy, FFT linearity, per-tone SNR) brought the accuracy-suite pass rate from ~75% to 98.8% and lifted typical SNR into the 130–140 dB range.

RTF in Practice

Module 03 defines RTF. This module interprets it. An RTF value on its own is half an answer — the same 0.01 means "comfortably real-time" for a 100 kHz ionospheric stream and "dangerously close" for a 10 MHz one. The ScientificMetricsAnalyzer is the bridge from raw RTF to a decision.

Headroom vs. Hard Limit

RTF under 1.0 means the pipeline finishes before the next window
                        arrives. RTF under 0.1 means there's room to add channels, stages, or
                        a larger NFFT without losing real-time. The analyzer reports both the
                        absolute RTF and the headroom in those terms.
Time / Frequency Resolution

Bigger NFFT buys frequency resolution at the cost of latency. The
                        analyzer surfaces both numbers side by side so a config choice becomes
                        a legible trade-off, not an invisible one.
Ionosphere Suitability

A domain-specific score combining RTF headroom, frequency resolution,
                        and SNR into one number — a quick "this config is useable for the VLF
                        beacon monitoring task" verdict without reading four plots.
Overrun as Ground Truth

If the streaming runner reports any overruns, the configuration is not
                        real-time — regardless of the mean RTF. Overrun count beats mean
                        latency as a go / no-go signal, and the dashboard surfaces it first.

Access_Source_Repository Return_to_Hub All_Projects