Differentiable Additive Synthesizer

medium (<30 mins) pytorch generative ddsp audio

this year by E

Description

Differentiable Digital Signal Processing (DDSP) is a technique that combines classic signal processing with deep learning by making the parameters of synthesizers learnable via gradient descent. [1] Your task is to implement the core of DDSP: a differentiable additive synthesizer. This synthesizer will generate audio by summing a series of harmonically related sine waves.

Guidance

Your synthesizer should be a function that takes learnable parameters (amplitudes and frequencies of harmonics) and outputs a raw audio waveform. The process is: 1. Generate a time axis t for the audio signal. 2. Create sinusoidal waves for each harmonic using the provided frequencies. 3. Scale each sinusoid by its corresponding amplitude. 4. Sum all the scaled sinusoids to produce the final waveform. Crucially, all operations must be differentiable PyTorch operations.

Starter Code

import torch

def differentiable_additive_synth(amplitudes, harmonic_distribution, fundamental_frequency, n_samples, sample_rate):
    """
    Generates audio by summing harmonically related sine waves.
    amplitudes: (batch, 1) overall amplitude.
    harmonic_distribution: (batch, n_harmonics) relative strength of each harmonic.
    fundamental_frequency: (batch, 1) the fundamental frequency in Hz.
    """
    # 1. Create a time vector 't' from 0 to n_samples / sample_rate.
    # 2. Create harmonic frequencies by multiplying the fundamental by (1, 2, 3, ...).
    # 3. Calculate the instantaneous frequency for each harmonic at each time step.
    #    (Hint: phase = cumulative sum of frequency * 2 * pi / sample_rate)
    # 4. Generate the sinusoids from the phase.
    # 5. Scale harmonics by the distribution and overall amplitude.
    # 6. Sum the harmonics to get the final waveform.
    pass

Verification

Create a simple target waveform (e.g., a sine wave at 440 Hz). Define the synthesizer parameters (amplitudes, harmonic_distribution, fundamental_frequency) as nn.Parameter tensors. In a training loop, generate audio with your synthesizer and compute the MSE loss against the target waveform. The parameters should converge to values that reproduce the target sound.

References

[1] Engel, J., et al. (2020). DDSP: Differentiable Digital Signal Processing. ICLR 2020.