Transcribe Module Upgrade Roadmap
Overview
The transcribe module in cnc-formats (behind the transcribe feature flag) provides WAV/PCM-to-MIDI audio transcription — converting audio waveforms into symbolic MIDI note data. The current implementation uses basic YIN pitch detection with energy-based onset detection and produces SMF Type 0 MIDI output. This roadmap defines a phased upgrade path from “basic demo” to commercial-tool-competitive quality.
Current state (2026-03-14): 60 tests passing, clippy clean, fmt clean. Files: pitch.rs (YIN pitch detection + freq_to_midi_note / midi_note_to_freq), onset.rs (energy-based onset detection, DetectedNote, velocity estimation), quantize.rs (note-to-MIDI events, VLQ encoding, SMF Type 0 assembly), mod.rs (public API: pcm_to_mid, pcm_to_notes, notes_to_mid, wav_to_mid, wav_to_xmi, mid_to_xmi), tests.rs (20 tests: functionality, errors, Display, determinism, XMI roundtrip), tests_validation.rs (19 tests: boundary, overflow, V38 adversarial).
Quality Tiers
The upgrade has two tiers, each behind its own feature flag:
| Aspect | PRISM ($69) | Current (basic YIN) | DSP-only (Phases 1-6, transcribe) | ML-enhanced (Phase 7, transcribe-ml) |
|---|---|---|---|---|
| Approach | Proprietary neural nets | Single-threshold YIN | pYIN + HMM + spectral flux | Basic Pitch CNN (~17K params) |
| Polyphonic | Yes (trained models) | No | Basic (HPS, 2-6 voices) | Yes (native multi-pitch) |
| Instrument-specific | Yes (piano, guitar, general) | No | No (generic spectral) | No (instrument-agnostic) |
| Monophonic quality | Excellent | Basic | Good (pYIN proven) | Excellent |
| Polyphonic quality | Excellent | N/A | Moderate (simple textures) | Good-Excellent |
| Pitch bend | Yes | No | Phase 6 only | Native (model output head) |
| Dependencies | Large ML runtime | Zero | Zero (Phases 1-4, 6) | ort or candle-core + ~3 MB model |
| Tuning | Sensitivity knob | yin_threshold only | Many knobs | min_confidence + onset_threshold |
| License | Proprietary | MIT/Apache-2.0 | MIT/Apache-2.0 | Apache-2.0 (model + code) |
PRISM (Aurally Sound, $69 plugin) is the commercial quality benchmark. DSP-only gets comparable to aubio/librosa/essentia with pYIN. ML-enhanced approaches PRISM-competitive quality using Spotify’s open-source Basic Pitch model.
Phase Plan
Phase 1: pYIN + Viterbi (~600 lines) – HIGHEST PRIORITY
Why: Eliminates octave errors (the biggest quality issue with basic YIN), smooths pitch transitions, gives voicing probability per frame.
Algorithm:
- Replace single YIN threshold with 100 thresholds (0.01-1.0, step 0.01)
- Weight candidates using Beta(alpha=2, beta=18) prior distribution
- Unvoiced probability = residual mass where no CMNDF dip found
- HMM Viterbi decoding: states = 480 pitch values (10-cent resolution, 50-800 Hz) + 1 unvoiced state
- Transition matrix: Gaussian kernel (sigma=13 cents) favoring small pitch changes
- Decode entire sequence -> smoothed pitch track
New parameters: use_pyin: bool (default: true), beta_alpha: f32 (2.0), beta_beta: f32 (18.0), hmm_transition_width: f32 (13.0 cents), voicing_penalty: f32 (0.01)
References: Mauch & Dixon, “pYIN: A Fundamental Frequency Estimator Using Probabilistic Threshold Distributions” (ICASSP 2014). pyin-rs crate on crates.io (study, don’t depend on).
Phase 2: SuperFlux Onset Detection (~300 lines)
Why: Better note boundaries, handles fast passages, vibrato-tolerant.
Algorithm:
- Compute STFT (n_fft=2048, hop=441 i.e. 10 ms at 44.1 kHz, Hann window)
- Apply log compression:
S_log = log(1 + gamma * |X|)where gamma=10-100 - Apply 138-band mel filterbank (quarter-tone resolution, 27.5-16000 Hz)
- Maximum filter of width 3 along frequency axis on previous frame (absorbs vibrato)
- Half-wave rectified spectral flux:
SF(n) = sum max(0, S(n) - S(n-lag)) - Adaptive threshold:
threshold(n) = median(SF[n-W:n+W]) + delta - Peak picking with minimum inter-onset interval
New parameters: onset_method: OnsetMethod (Energy | SpectralFlux | SuperFlux, default Energy), onset_gamma: f32 (10.0), onset_threshold_delta: f32 (0.05), min_inter_onset_ms: u32 (30), onset_lag: u8 (2)
FFT requirement: Requires a basic FFT implementation. Options: inline radix-2 Cooley-Tukey in ~150 lines (sufficient for fixed power-of-2 sizes), or add rustfft as optional dep behind the transcribe feature. This decision also applies to Phase 5 (polyphonic HPS).
References: Bock & Widmer, “Maximum Filter Vibrato Suppression for Onset Detection” (DAFx 2013).
Phase 3: Confidence Scoring (~100 lines)
Why: Lets users filter by quality – “only keep notes I’m sure about.”
Algorithm: Fuse three signals per frame:
yin_confidence = 1.0 - cmndf_min(already available from YIN)hnr = 10 * log10(r(tau) / (1 - r(tau)))(harmonic-to-noise ratio from autocorrelation)spectral_flatness = geometric_mean(|X|) / arithmetic_mean(|X|)(Wiener entropy; 0=tonal, 1=noise)
Combined: confidence = 0.5*(1-cmndf) + 0.3*sigmoid(hnr-5) + 0.2*(1-flatness)
New parameter: min_confidence: f32 (default: 0.0, i.e. keep all). Adds confidence: f32 field to DetectedNote.
Phase 4: Median Filter Smoothing (~50 lines)
Why: Removes isolated glitch frames (single-frame octave jumps, noise spikes).
Algorithm: After pitch detection, before onset segmentation, apply a median filter of configurable width to the MIDI note sequence.
New parameter: median_filter_width: u8 (odd number, 0=disabled, default: 3)
Phase 5: Basic Polyphonic Detection (~400 lines)
Why: Detect 2-6 simultaneous voices without ML.
Algorithm: Harmonic Product Spectrum (HPS) with iterative subtraction:
- Compute FFT magnitude spectrum
- Downsample by factors 2..H, multiply -> HPS peak = fundamental
- Subtract detected harmonics (Gaussian spectral template, width ~20 Hz)
- Repeat on residual to find next voice
- Stop when HPS peak-to-median ratio < threshold
New parameters: max_voices: u8 (1=mono default, 2-6=poly), num_harmonics: u8 (5), subtraction_gain: f32 (0.9), peak_threshold: f32 (3.0)
Output change: Multi-voice produces Type 1 MIDI (one track per voice) or all on channel 0 with overlapping notes.
Phase 6: Pitch Bend Output (~100 lines)
Why: Preserves expression – portamento, vibrato, micro-tuning.
Algorithm: When the detected frequency deviates from the nearest MIDI note by more than a configurable threshold, emit MIDI pitch bend events alongside the note.
New parameter: pitch_bend: bool (default: false)
Phase 7: ML-Enhanced Transcription (~500 lines) – PREMIUM
Why: Replaces the entire DSP pitch+onset pipeline with a single neural model that natively outputs polyphonic notes, onsets, and pitch bends. This is the path to commercial-competitive quality.
Model: Spotify Basic Pitch
- Apache-2.0 license (code and weights)
- ~17,000 parameters, <20 MB peak memory, ~3 MB ONNX weights
- Architecture: harmonic stacking input -> shallow CNN -> 3 output heads (notes, onsets, pitch bends)
- Polyphonic, instrument-agnostic, includes pitch bend detection natively
- Paper: Bittner et al., “A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation” (ICASSP 2022)
- ONNX weights ship with the official Python package and are on Hugging Face (
spotify/basic-pitch) - Prior art:
basicpitch.cpp(C++20 port with ONNXRuntime) proves the model runs outside Python
Integration – two options:
Option A: ort (ONNX Runtime) | Option B: candle (pure Rust) | |
|---|---|---|
| Crate | ort v2.x (pyke.io) | candle-core + candle-nn v0.9.x |
| How | Load Basic Pitch .onnx directly | Reimplement ~17K-param CNN in Rust, load safetensors weights |
| Native deps | Links to ONNX Runtime C library | None (pure Rust) |
| Code effort | ~200 lines (glue + pre/post processing) | ~400 lines (model architecture + weight loading) |
| GPU support | CUDA, DirectML, CoreML via execution providers | CUDA, Metal via candle backends |
| Maturity | Production-grade (Microsoft-backed runtime) | Newer but actively maintained (Hugging Face) |
| Model format | .onnx (standard, portable) | .safetensors (needs weight conversion from ONNX) |
Recommendation: Start with Option A (ort) for fastest path to working ML inference. The ort crate wraps ONNX Runtime which is battle-tested. Basic Pitch already ships ONNX weights. If pure-Rust becomes a hard requirement later, port to candle – the model is small enough that reimplementing the architecture is straightforward.
Feature flag structure:
[features]
transcribe = ["midi"] # DSP-only
transcribe-ml = ["transcribe", "dep:ort"] # ML-enhanced via ONNX Runtime
# Alternative pure-Rust path:
# transcribe-ml = ["transcribe", "dep:candle-core", "dep:candle-nn"]
New files: src/transcribe/ml.rs (model loading, pre-processing, inference, post-processing), src/transcribe/ml_tests.rs (synthetic audio tests, comparison with DSP path, adversarial inputs)
How it integrates with existing API:
#![allow(unused)]
fn main() {
pub fn pcm_to_mid(samples: &[f32], sample_rate: u32, config: &TranscribeConfig) -> Result<Vec<u8>> {
#[cfg(feature = "transcribe-ml")]
if config.use_ml {
return ml::pcm_to_notes_ml(samples, sample_rate, config)
.map(|notes| notes_to_mid(¬es, config));
}
// Fall back to DSP pipeline
let pitches = pitch::detect_pitches(/* ... */);
// ...
}
}
New config parameters: use_ml: bool (prefer ML model when transcribe-ml is enabled, default: true when feature active), ml_onset_threshold: f32 (0.5), ml_note_threshold: f32 (0.5), ml_model_path: Option<PathBuf> (custom model path, default: embedded or downloaded)
Model weight distribution options:
- Embed in binary via
include_bytes!(~3 MB increase in binary size) – simplest - Separate
cnc-formats-modelscrate – keeps main crate small, model downloaded as dep - Download on first use – smallest binary, requires network access
Reusability beyond MIDI transcription: The ML infrastructure (ort or candle) unlocked by this phase enables future modules: audio classification (instrument detection), format detection (classify unknown binary blobs), sprite upscaling (SHP super-resolution), palette optimization (learned palette generation). These would be separate feature flags sharing the same runtime dependency.
Complete TranscribeConfig (after all phases)
#![allow(unused)]
fn main() {
pub struct TranscribeConfig {
// --- Core (Phase 0, already implemented) ---
pub yin_threshold: f32, // 0.0-1.0, default 0.15
pub window_size: usize, // default 2048
pub hop_size: usize, // default 512
pub min_freq: f32, // Hz, default 80.0
pub max_freq: f32, // Hz, default 2000.0
pub min_duration_ms: u32, // default 50
pub ticks_per_beat: u16, // default 480
pub tempo_bpm: u16, // default 120
pub channel: u8, // 0-15, default 0
pub velocity: u8, // 1-127, default 100
pub estimate_velocity: bool, // default false
// --- Phase 1: pYIN ---
pub use_pyin: bool, // default true
pub beta_alpha: f32, // Beta prior shape, default 2.0
pub beta_beta: f32, // Beta prior shape, default 18.0
pub hmm_transition_width: f32, // cents, default 13.0
pub voicing_penalty: f32, // default 0.01
// --- Phase 2: Onset ---
pub onset_method: OnsetMethod, // Energy|SpectralFlux|SuperFlux, default Energy
pub onset_gamma: f32, // log compression, default 10.0
pub onset_threshold_delta: f32, // adaptive offset, default 0.05
pub min_inter_onset_ms: u32, // default 30
pub onset_lag: u8, // SuperFlux lag, default 2
// --- Phase 3: Confidence ---
pub min_confidence: f32, // 0.0-1.0, default 0.0
// --- Phase 4: Smoothing ---
pub median_filter_width: u8, // 0=disabled, default 3
// --- Phase 5: Polyphonic ---
pub max_voices: u8, // 1=mono, 2-6=poly, default 1
pub num_harmonics: u8, // HPS harmonics, default 5
pub subtraction_gain: f32, // harmonic removal, default 0.9
pub peak_threshold: f32, // HPS acceptance, default 3.0
// --- Phase 6: Expression ---
pub pitch_bend: bool, // default false
// --- Phase 7: ML (behind transcribe-ml feature) ---
pub use_ml: bool, // prefer ML model when available, default true
pub ml_onset_threshold: f32, // onset activation threshold, default 0.5
pub ml_note_threshold: f32, // note activation threshold, default 0.5
pub ml_model_path: Option<std::path::PathBuf>, // custom .onnx path, None=embedded
// --- Post-processing ---
pub quantize_grid: Option<u32>, // snap to grid (ticks), None=free
}
}
Test Strategy Per Phase
Each phase adds tests following AGENTS.md testing requirements:
- Happy path with synthetic audio (known frequencies -> known MIDI notes)
- Comparison: pYIN should detect the same note as YIN on clean sine, but NOT produce octave errors on edge cases
- Adversarial: NaN, Infinity, all-zeros, all-ones -> no panic
- Determinism: same input -> same output
- Boundary: min/max frequency, single frame, huge input
Key Design Constraints
DSP path (Phases 1-4, 6): No new crate dependencies. All algorithms are pure arithmetic on f32 slices. Phase 5 (polyphonic) needs FFT – either inline radix-2 Cooley-Tukey (~150 lines) or optional rustfft dep. Phase 2 (SuperFlux) also needs FFT, so the FFT question must be decided at Phase 2.
ML path (Phase 7): Gated behind transcribe-ml feature flag. The DSP path must remain fully functional without ML deps – transcribe alone never pulls in ort or candle. Users who don’t want the ML overhead get the same zero-dep DSP pipeline. The ML path is strictly additive.
External References
DSP:
- pYIN paper: Mauch & Dixon, ICASSP 2014
- SuperFlux: Bock & Widmer, DAFx 2013
- HPS: Schroeder (1968), improved by Noll (1970)
- YIN: de Cheveigne & Kawahara, JASA 2002
pyin-rscrate: crates.io (Rust pYIN reference, study only)pitch-detectioncrate: crates.io (McLeod pitch method, Rust)
ML:
- Basic Pitch paper: Bittner et al., ICASSP 2022
- Basic Pitch repo: github.com/spotify/basic-pitch (Apache-2.0)
- Basic Pitch weights: huggingface.co/spotify/basic-pitch
- basicpitch.cpp (C++ port): github.com/sevagh/basicpitch.cpp
- CREPE: Kim et al., ICASSP 2018 (MIT)
- CREPE ONNX weights: github.com/yqzhishen/onnxcrepe
ortcrate: crates.io/crates/ort (ONNX Runtime for Rust, pyke.io)candleframework: github.com/huggingface/candle (MIT/Apache-2.0)
Commercial:
- PRISM: aurallysound.com (ML-based, $69, proprietary – quality benchmark only)