Core GPU Engine

>_ Module 01 — C++17 / CUDA Backend

The engine under SigTekX: a C++17 / CUDA17 pipeline that turns a raw PCM stream into a magnitude spectrum in under two hundred microseconds. Two executors — one built for maximum throughput, one built for continuous streaming — share the same kernels but take opposite paths through memory. This page walks the architecture layer by layer: from the pybind11 surface down through the ring buffer, the three-stream pipeline, the cuFFT plans, and the RAII wrappers that keep it all leak-free.

Architecture at a Glance

The C++ side is split into eight focused layers. Public headers stay CUDA-free so the Python build can include them without pulling in the toolkit. Everything CUDA-specific lives in implementation files behind the Pimpl boundary.

Python Bindings

pybind11 zero-copy bridge — NumPy arrays straight into executor submit()

Executor API

BatchExecutor and StreamingExecutor behind a shared PipelineExecutor interface

Processing Pipeline

Window → FFT → Magnitude stages composed via Strategy + Factory patterns

Ring Buffer

Lock-free SPSC circular buffer in pinned memory with zero-copy peek_frame()

CUDA Resources

RAII wrappers — CudaStream, CudaEvent, DeviceBuffer, PinnedHostBuffer, CufftPlan

CUDA Kernels

Window kernel, cuFFT real-to-complex, magnitude kernel — coalesced access only

Utilities

window_utils, signal_utils — generators, symmetry modes, reference paths

Profiling

NVTX ranges throughout the pipeline for Nsight Systems and Compute

Dual Executors

Two executors, one pipeline. BatchExecutor takes fixed-size input and copies it straight into device memory for maximum throughput. StreamingExecutor accepts arbitrary chunks, accumulates them in per-channel ring buffers, and drains frames as soon as they're available. The gap between them — about 40% on latency — is the architectural cost of real-time flexibility, not a bug.

BatchExecutor

High-Throughput Direct Pipeline

Mean Latency

86.79 µs

P95 Latency

105.47 µs

Input Shape

Fixed batch

Memory

~164 KB

Single memcpy — host buffer straight into d_input
Round-robin device buffers for H2D/compute overlap
One submit() call equals exactly one batch
Zero ring-buffer overhead on the CPU side
Use for offline analysis and benchmarking

StreamingExecutor

Low-Latency Continuous Stream

Mean Latency

122.25 µs

P95 Latency

153.82 µs

Input Shape

Arbitrary chunks

Memory

~262 KB

Per-channel pinned ring buffers (3 × NFFT capacity)
Zero-copy peek_frame() — DMA direct from pinned memory
Drains all ready frames per submit() — no overflow on warmup
Optional background consumer thread (lock-free SPSC)
Use for sensor integration and real-time monitoring

Processing Pipelines

Both executors use the same three-stage kernel pipeline (Window → FFT → Magnitude) and the same three CUDA streams (H2D, Compute, D2H). What differs is how input samples arrive at d_input. BatchExecutor copies once. StreamingExecutor pushes into a ring buffer first, then DMAs directly from it.

FIG 01: BatchExecutor — direct memcpy pipeline across three CUDA streams

FIG 02: StreamingExecutor — ring buffer accumulation, zero-copy peek_frame() DMA

Memory Model

The memory footprint mirrors the behavioral split. Batch mode has no streaming accumulator — just double-buffered device memory sized for one fixed batch. Streaming mode adds per-channel pinned ring buffers that hold three frames' worth of samples at all times, enough to survive long warmup runs without overflowing.

FIG 03: Batch memory — ~164 KB total, no ring buffers, double-buffered pipeline

FIG 04: Streaming memory — pinned ring buffers serve directly as H2D DMA source

Engineering Highlights

Zero-Copy Ring Buffer

Ring buffers live in CUDA pinned memory. The
peek_frame() API returns a span pointing directly into that
                        memory — no staging hop. cudaMemcpyAsync DMAs straight from the
                        ring buffer. Wraparound frames issue two spans; advance() runs
                        only after the D2H sync so pointers stay valid during transfer.
Three-Stream Pipeline

Three CUDA streams — H2D, Compute, D2H —
                        with event-based dependencies. Frame N+1's H2D overlaps frame N's compute,
                        which overlaps frame N−1's D2H. Round-robin device buffers let the pipeline
                        run two or three frames deep without ever allocating on the hot path.
Pimpl + RAII

Every public header is CUDA-free — the Pimpl idiom hides cudaStream_t,
cufftHandle, and the rest behind a unique_ptr<Impl>.
                        CUDA resources are RAII-wrapped (CudaStream, CufftPlan,
DeviceBuffer, PinnedHostBuffer), move-only, and
                        self-destructing — no manual cleanup, no leaks on exception paths.
Strategy + Factory

Each pipeline stage (WindowStage, FFTStage,
MagnitudeStage) implements a minimal
ProcessingStage interface. StageFactory composes
                        pipelines from configuration at runtime, so adding a new stage — bandpass,
                        PSD, log-magnitude — is a single class implementation plus one factory entry.

Measured Performance

End-to-end benchmarks from the Python layer — same pipeline, measured across the pybind11 boundary with locked GPU clocks for low-variance numbers. The C++ engine in isolation runs faster still; the ~61 µs gap is the full cost of zero-copy NumPy handoff.

171.1 µs

Mean Latency

305.5 µs

P99 Latency

138.4 dB

Spectral SNR

4,834 FPS

Throughput (1 Ch)

237.6 MSPS

Peak Throughput (8 Ch)

0.003

RTF (100 kHz Batch)

>_ RTX 3090 Ti · Ryzen 9 5950X · Python E2E · NFFT=4096 · 100 kHz streaming · GPU clocks locked
>_ C++ backend isolated: 109.6 µs mean / 287.7 µs p99 — pybind11 overhead ≈ 61 µs / frame

Test Coverage

The C++ test suite uses Google Test with gcovr reports. Headline coverage is strong across the hot path — executors, kernels, ring buffer, pipeline. The uncovered portions concentrate in files that are intentionally not unit-tested at the C++ layer: signal_utils.cpp (test signal generation, validated against reference implementations), window_functions.hpp (pre-computed window coefficients cross-checked against NumPy/SciPy), and executor_config.hpp (a plain POD configuration struct).

Overall gcovr coverage — executors, kernels, ring buffer

Uncovered hot spots isolated to non-critical support files

Component Architecture

The full class and dependency map for the C++ side — every namespace, every public type, every processing stage. The diagram is dense by design; click to open it in the explorer for full-size panning and zooming.

Full C++ component and class architecture

FIG 05: Full class / component map — all eight layers and their dependencies

Access_Source_Repository Return_to_Hub All_Projects