SE on DAPS

Perceptually-motivated
Environment-specific
Speech Enhancement

Jiaqi Su, Adam Finkelstein, Zeyu Jin

We introduce a data-driven method to enhance speech recordings made in a specific environment. The method handles denoising, de-reverberation, and equalization matching due recording nonlinearities in a unified framework. It relies on a new perceptual loss function that combines adversarial loss with spectrogram features. We show that the method offers an improvement over state of the art baseline methods in both subjective and objective evaluations.

SAMPLES

This page contains the sentences used in the MOS test described in Section 3. Click in the grid of buttons below to play the audio. Colors (ranging from red=bad to green=good) encode the scores which are also shown as text in the button labels.

INPUT: Noisy reverberant input recording that is to be enhanced
WPE: Weighted Prediction Error, a traditional baseline method
BLSTM: A baseline spectrogram masking method using bidirectional LSTM
WN: Feedforward Wavenet architecture with L1 loss
SPEC: Feedforward Wavenet architecture with log spectrogram loss
GAN: Generative Adevesial network, composed of SPEC generator and discriminator from StarGAN-VC
REF: Studio-quality target recording