RTNeural Trainer started with a simple product shape: give the app a dry guitar recording, give it the same performance through an amp or pedal, train a compact model, and export something that can run in a real-time audio plugin.
That sounds like machine learning. In practice, it quickly became a DSP systems problem with a model in the middle.
The model cannot merely be accurate in Python. It has to be causal. It has to stream at audio rate. It has to fit through RTNeural's JSON runtime. It has to match TensorFlow/Keras closely enough after export. It has to benchmark above real time in native C++. It has to survive a DAW buffer size that is measured in samples, not vibes.
That constraint is what shaped the research.
We published the technical preprint today on Zenodo: Aliasing-Aware RTNeural-Compatible WaveNet Modeling of Guitar Amplifier and Pedal Captures. The paper is the formal version of the work: the architecture comparisons, capture notes, aliasing diagnostics, export gates, and runtime conclusions behind the trainer.
This post is the workshop version.
The Problem Is Not Just Tone Matching
The supervised-learning view is clean enough:
dry guitar x[n] -> amplifier or pedal H -> target y[n]
train f_theta(x[n]) ~= y[n]
But guitar rigs are not static transfer functions. They have memory. They compress, slew, filter, saturate, recover, and react differently to palm mutes, single notes, hard pick attack, chords, and decays into the noise floor.
High-gain tones are especially unforgiving. A model can produce a waveform that looks close in aggregate while still leaking metallic foldback, missing pick transients, or shaving off the small dynamics that make a capture feel connected to the hands.
And then there is the export target. We are not training a giant offline model for rendering. RTNeural Trainer is aimed at compact models that can become real-time audio DSP:
- Prepared 48 kHz mono paired captures
- TensorFlow/Keras training
- RTNeural-compatible JSON export
- Python parity checks
- Native C++ RTNeural validation
- Native benchmark reports
- Aliasing probes
- DAW/plugin smoke testing
That pipeline changed what counted as a successful model. A checkpoint with a nice validation loss was only a candidate. The real question was whether the exact exported package still behaved after leaving the training environment.
Why WaveNet Became The Main Lane
Early versions of the trainer kept the architecture menu broad: dense layers, GRU, LSTM, shallow Conv1D, BatchNorm/PReLU variants, Conv-GRU hybrids, and WaveNet-style temporal convolutional networks.
That breadth was useful for building exporter coverage, but it did not stay useful as a product recommendation. Across clean, crunch, rhythm, edge-of-breakup, lead, overdrive pedal, and later real hardware captures, the best-tested path kept coming back to WaveNet-family causal Conv1D stacks.
The reason is practical. A WaveNet-style temporal convolutional network has the right kind of memory for this job:
- Causal layers can run forward in time.
- Dilations create a finite receptive field without recurrent state at every sample.
- Conv1D layers map cleanly into RTNeural JSON.
- Width, depth, kernel size, and dilation schedule can be adjusted by preset.
- Native validation remains tractable.
The current standard graph is intentionally simple:
x -> [causal Conv1D -> activation] * N -> output layer -> y_hat
That is not full academic WaveNet with residual and skip connections. It is a deliberately exportable approximation: enough temporal modeling to learn amp and pedal behavior, but constrained enough that the model can be serialized, loaded, compared, and benchmarked inside the real target runtime.
The product-facing presets followed from that evidence. wavenet_tcn_balanced became the practical first quality run. wavenet_tcn_quality became the higher-capacity choice for crunch, high-gain rhythm, and maximum-fidelity exports. wavenet_tcn_edge handled clean-to-edge captures where a mostly linear model missed light breakup. wavenet_tcn_a2_prelu became the strongest high-gain research candidate while still staying RTNeural-safe.
The older non-WaveNet presets are still useful as internal fixtures. They help test RTNeural layer coverage and historical comparisons. But they are no longer the trainer's center of gravity.
A2-Inspired, But Not A2 Yet
Neural Amp Modeler's A2 architecture was one of the important references for the work. A2 points in a strong direction: WaveNet-like temporal modeling, small enough to run in real time, shaped by the realities of guitar amp capture.
But exact A2 support is not a free export target for RTNeural's current dynamic JSON path. True A2 includes residual behavior, skip accumulation, head convolutions, input mix-in paths, and slimmable model selection. The current RTNeural dynamic model path is much closer to a sequential chain.
So the trainer uses an A2-inspired preset rather than claiming A2 equivalence. The wavenet_tcn_a2_prelu preset uses mixed kernel sizes, non-power-of-two dilations, and PReLU nonlinearities while remaining a sequential RTNeural-compatible graph.
That compromise turned out to be musically useful. On the RHYTHM4 high-gain case, the A2-inspired preset improved the earlier quality/smoothed-tanh result substantially: preview ESR moved from 0.0646 to 0.0381 after continuation, average aliasing-to-signal ratio fell from 0.0419 to 0.0145, and the native Eigen benchmark still reported roughly 6.62x worst-case real-time factor on the reference system.
That was the point where the preset stopped feeling like a side experiment. It was slower and larger than the basic quality model, but it was still inside the real-time envelope on the test machine, and it sounded better in the plugin smoke path.
The next deeper step is probably a fused a2_wavenet or residual_tcn layer in our RTNeural fork, not a full generic graph runtime as the first move. NAM Core itself treats A2 as a specialized fast path, and that is a good clue. Real-time audio often rewards the known narrow primitive over the elegant general abstraction.
Aliasing Needed Its Own Gate
Nonlinear audio models generate harmonics. If those harmonics cross Nyquist, they fold back.
That matters a lot for distorted guitar because upper harmonics, fizz, pick edge, and cabinet filtering all sit near the line between desirable and awful. A model can look better on waveform error while producing aliasing that is obvious on sustained high notes.
So RTNeural Trainer added a probe-based aliasing-to-signal report. The current diagnostic renders deterministic sine probes through the exported model, computes harmonic energy versus non-harmonic folded energy, and reports ASR as an engineering warning.
It is not a psychoacoustic truth machine. It does not replace listening. But it catches a class of failure that ESR alone can miss.
The smoothed-tanh experiments made that obvious. Training with gentler hidden tanh(x / alpha) activations sometimes reduced probe ASR, but not universally. On one rhythm test, smoothing lowered aliasing while worsening waveform fit. On RHYTHM4, tanh(x / 1.5) improved both ESR and ASR relative to the plain quality model. That is exactly the kind of result that needs an export report rather than a slogan.
The practical export gate became multi-objective:
ready =
parity passes
AND native validation passes
AND native benchmark has headroom
AND preview metrics are acceptable
AND ASR/listening review is acceptable
That last line matters. The app should not pretend a single number knows whether a capture feels right. It should expose the evidence so the user can make a better decision.
Capture Quality Was A First-Order Variable
Some of the biggest improvements did not come from changing the model. They came from respecting the measurement problem.
Latency is the fragile one. If the dry input and processed target are off by a few samples, the model learns a confused version of the target. Clean tones are often easy to align. Heavy distortion can make the processed waveform only weakly correlated with the DI, and several offsets can look plausible.
The workflow now treats latency as something to inspect:
- Estimate the offset.
- Show confidence and window agreement.
- Preserve top candidates.
- Allow manual nudge.
- Prefer known DAW-render latency when possible.
- Encourage a transient preamble before the musical pass.
That preamble is simple but valuable: a few seconds of clear dry attacks, palm mutes, single-note plucks, and different registers before the main performance. It gives the estimator something less ambiguous than dense distorted music.
The other capture lesson was source material density. Ten minutes of audio is not automatically better than three focused minutes. Long captures can include too much silence, repeated material, or low-information decay. Shorter captures worked well when they were varied, trimmed, and kept at healthy levels.
The strongest hardware export trio came after those workflow changes had matured. A real amp clean/edge capture, a real overdrive pedal capture, and a real amp rhythm capture all passed native RTNeural validation with low probe ASR:
- Clean/edge: ESR
0.00217, average ASR0.00251, native RTF20.72x - Overdrive pedal: ESR
0.00043, average ASR0.00080, native RTF6.49x - Rhythm amp: ESR
0.00355, average ASR0.00359, native RTF6.63x
That was one of the best signs that the pipeline was real. Not just a notebook. Not just a preview WAV. Actual export packages surviving the native validation path.
Runtime Proof Changed The Product
One of the healthiest habits in this project has been refusing to separate training quality from runtime behavior.
Every serious export goes through RTNeural validation and benchmarking. The native validator loads the JSON model, compares against the Keras reference path, checks block-size and channel cases, and records the worst real-time factor. The desktop app surfaces that evidence instead of hiding it behind a vague "export complete" message.
That has already changed product decisions. A theoretically smaller grouped/depthwise Conv1D experiment looked attractive on paper, but the current dynamic JSON runtime did not make it faster than the balanced WaveNet path. The lesson was not "separable convolution is bad." The lesson was more specific: in this runtime, with this dynamic layer overhead, lower theoretical MAC count did not automatically mean better plugin behavior.
The same thinking applies to future work. Dynamic JSON is an excellent validation and user-export format. It may not be the final performance ceiling for product plugins. Static ModelT shapes, fused Conv1D/activation blocks, backend-specific kernels, oversampling, and constrained residual TCN layers are all on the table when the plugin path needs more headroom.
But those should be pulled by measurements, not by anxiety. The current WaveNet and A2-inspired exports are already viable enough to keep moving.
What The Paper Claims, Carefully
The Zenodo preprint does not claim that WaveNet is universally best for every neural audio effect. It makes a narrower and more useful claim:
Within the current 48 kHz mono paired-capture workflow, and under RTNeural JSON export constraints, WaveNet-family temporal convolutional models are the strongest product path we have tested so far.
That scope matters. The dataset is internal. The benchmarks are hardware-specific. The ASR thresholds are early. Listening tests still matter. Other architectures may win under different runtime targets, datasets, or model formats.
But for this tool, right now, the evidence is strong enough to simplify the app around the path that works.
That is the part I like most about this research. It did not end with a model zoo. It ended with a sharper workflow:
- Capture cleanly.
- Align carefully.
- Train a curated WaveNet-family preset.
- Compare preview and residual evidence.
- Export to RTNeural JSON.
- Validate against the source model.
- Benchmark in native C++.
- Review aliasing.
- Load it in a real plugin path.
That is the difference between "we trained a neural network" and "we built a model that can leave the lab."
RTNeural Trainer is still a local research workbench, but the direction is clearer now. The interesting work is not just making neural amp captures more accurate. It is making them inspectable, repeatable, and boring enough to trust inside a real-time audio thread.
