Core GPU Engine

>_ Module 01 — C++17 / CUDA Backend

The engine under SigTekX: a C++17 / CUDA17 pipeline that turns a raw PCM stream into a magnitude spectrum in under two hundred microseconds. Two executors — one built for maximum throughput, one built for continuous streaming — share the same kernels but take opposite paths through memory. This page walks the architecture layer by layer: from the pybind11 surface down through the ring buffer, the three-stream pipeline, the cuFFT plans, and the RAII wrappers that keep it all leak-free.

Architecture at a Glance

The C++ side is split into eight focused layers. Public headers stay CUDA-free so the Python build can include them without pulling in the toolkit. Everything CUDA-specific lives in implementation files behind the Pimpl boundary.

Python Bindings
pybind11 zero-copy bridge — NumPy arrays straight into executor submit()
Executor API
BatchExecutor and StreamingExecutor behind a shared PipelineExecutor interface
Processing Pipeline
Window → FFT → Magnitude stages composed via Strategy + Factory patterns
Ring Buffer
Lock-free SPSC circular buffer in pinned memory with zero-copy peek_frame()
CUDA Resources
RAII wrappers — CudaStream, CudaEvent, DeviceBuffer, PinnedHostBuffer, CufftPlan
CUDA Kernels
Window kernel, cuFFT real-to-complex, magnitude kernel — coalesced access only
Utilities
window_utils, signal_utils — generators, symmetry modes, reference paths
Profiling
NVTX ranges throughout the pipeline for Nsight Systems and Compute

Dual Executors

Two executors, one pipeline. BatchExecutor takes fixed-size input and copies it straight into device memory for maximum throughput. StreamingExecutor accepts arbitrary chunks, accumulates them in per-channel ring buffers, and drains frames as soon as they're available. The gap between them — about 40% on latency — is the architectural cost of real-time flexibility, not a bug.

BatchExecutor
High-Throughput Direct Pipeline
Mean Latency
86.79 µs
P95 Latency
105.47 µs
Input Shape
Fixed batch
Memory
~164 KB
  • Single memcpy — host buffer straight into d_input
  • Round-robin device buffers for H2D/compute overlap
  • One submit() call equals exactly one batch
  • Zero ring-buffer overhead on the CPU side
  • Use for offline analysis and benchmarking
StreamingExecutor
Low-Latency Continuous Stream
Mean Latency
122.25 µs
P95 Latency
153.82 µs
Input Shape
Arbitrary chunks
Memory
~262 KB
  • Per-channel pinned ring buffers (3 × NFFT capacity)
  • Zero-copy peek_frame() — DMA direct from pinned memory
  • Drains all ready frames per submit() — no overflow on warmup
  • Optional background consumer thread (lock-free SPSC)
  • Use for sensor integration and real-time monitoring

Processing Pipelines

Both executors use the same three-stage kernel pipeline (Window → FFT → Magnitude) and the same three CUDA streams (H2D, Compute, D2H). What differs is how input samples arrive at d_input. BatchExecutor copies once. StreamingExecutor pushes into a ring buffer first, then DMAs directly from it.

BatchExecutor sequence diagram
FIG 01: BatchExecutor — direct memcpy pipeline across three CUDA streams
StreamingExecutor sequence diagram
FIG 02: StreamingExecutor — ring buffer accumulation, zero-copy peek_frame() DMA

Memory Model

The memory footprint mirrors the behavioral split. Batch mode has no streaming accumulator — just double-buffered device memory sized for one fixed batch. Streaming mode adds per-channel pinned ring buffers that hold three frames' worth of samples at all times, enough to survive long warmup runs without overflowing.

BatchExecutor memory layout diagram
FIG 03: Batch memory — ~164 KB total, no ring buffers, double-buffered pipeline
StreamingExecutor memory layout diagram
FIG 04: Streaming memory — pinned ring buffers serve directly as H2D DMA source

Engineering Highlights

Zero-Copy Ring Buffer
Ring buffers live in CUDA pinned memory. The peek_frame() API returns a span pointing directly into that memory — no staging hop. cudaMemcpyAsync DMAs straight from the ring buffer. Wraparound frames issue two spans; advance() runs only after the D2H sync so pointers stay valid during transfer.
Three-Stream Pipeline
Three CUDA streams — H2D, Compute, D2H — with event-based dependencies. Frame N+1's H2D overlaps frame N's compute, which overlaps frame N−1's D2H. Round-robin device buffers let the pipeline run two or three frames deep without ever allocating on the hot path.
Pimpl + RAII
Every public header is CUDA-free — the Pimpl idiom hides cudaStream_t, cufftHandle, and the rest behind a unique_ptr<Impl>. CUDA resources are RAII-wrapped (CudaStream, CufftPlan, DeviceBuffer, PinnedHostBuffer), move-only, and self-destructing — no manual cleanup, no leaks on exception paths.
Strategy + Factory
Each pipeline stage (WindowStage, FFTStage, MagnitudeStage) implements a minimal ProcessingStage interface. StageFactory composes pipelines from configuration at runtime, so adding a new stage — bandpass, PSD, log-magnitude — is a single class implementation plus one factory entry.

Measured Performance

End-to-end benchmarks from the Python layer — same pipeline, measured across the pybind11 boundary with locked GPU clocks for low-variance numbers. The C++ engine in isolation runs faster still; the ~61 µs gap is the full cost of zero-copy NumPy handoff.

171.1 µs
Mean Latency
305.5 µs
P99 Latency
138.4 dB
Spectral SNR
4,834 FPS
Throughput (1 Ch)
237.6 MSPS
Peak Throughput (8 Ch)
0.003
RTF (100 kHz Batch)
>_ RTX 3090 Ti · Ryzen 9 5950X · Python E2E · NFFT=4096 · 100 kHz streaming · GPU clocks locked
>_ C++ backend isolated: 109.6 µs mean / 287.7 µs p99 — pybind11 overhead ≈ 61 µs / frame

Test Coverage

The C++ test suite uses Google Test with gcovr reports. Headline coverage is strong across the hot path — executors, kernels, ring buffer, pipeline. The uncovered portions concentrate in files that are intentionally not unit-tested at the C++ layer: signal_utils.cpp (test signal generation, validated against reference implementations), window_functions.hpp (pre-computed window coefficients cross-checked against NumPy/SciPy), and executor_config.hpp (a plain POD configuration struct).

C++ test coverage summary
Overall gcovr coverage — executors, kernels, ring buffer
Annotated view of low-coverage files
Uncovered hot spots isolated to non-critical support files

Component Architecture

The full class and dependency map for the C++ side — every namespace, every public type, every processing stage. The diagram is dense by design; click to open it in the explorer for full-size panning and zooming.

Full C++ component and class architecture
FIG 05: Full class / component map — all eight layers and their dependencies
Access_Source_Repository Return_to_Hub All_Projects
EXPLORING DIAGRAM
Technical Schematic
SCROLL TO ZOOM | DRAG TO PAN | ESC TO CLOSE