HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

HiFi-GAN: High-Fidelity Denoising and Dereverberation
Based on Speech Deep Features in Adversarial Networks

Jiaqi Su, Zeyu Jin, Adam Finkelstein

Real-world audio recordings are often degraded by factors such as noise, reverberation, and equalization distortion. This paper introduces HiFi-GAN, a deep learning method to transform recorded speech to sound as though it had been recorded in a studio. We use an end-to-end feed-forward WaveNet architecture, trained with multi-scale adversarial discriminators in both the time domain and the time-frequency domain. It relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech. The proposed model generalizes well to new speakers, new speech content, and new environments. It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments.
Here, generator G includes a feed-forward WaveNet for speech enhancement, followed by a convolutional Postnet for cleanup. Discriminators evaluate the resulting waveform (\(D_W\), at multiple resolutions) and mel-spectrogram (\(D_S\)).

Real Demo for Ted Talk

Original input:

HiFi-GAN enhanced result:

Real Demo for VCTK Noisy

Original input:

HiFi-GAN enhanced result:

Real Demo for DAPS

Original input:

HiFi-GAN enhanced result:

* Using a model trained on our augmented synthetic dataset with speech corpus from the DAPS Dataset [7] and room impulse responses from MIT IR Survey Dataset [8].

SAMPLES

Choose a dataset:

Choose a speaker:

Choose a source environment:

Loading......

REFERENCES

P. Scalart et al., “Speech enhancement based on a priori signal to noise estimation,” in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 2. IEEE, 1996, pp. 629–632.
T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010.
F. G. Germain, Q. Chen, and V. Koltun, “Speech denoising with deep feature losses,” Proc. Interspeech 2019, pp. 2723–2727.
R. Giri, U. Isik, and A. Krishnaswamy, “Attention wave-u-net for speech enhancement,” in WASPAA 2019, pp. 249–253.
W. Mack, S. Chakrabarty, F.-R. St¨oter, S. Braun, B. Edler, and E. Habets, “Single-channel dereverberation using direct mmse optimization and bidirectional lstm networks,” Proc. Interspeech 2018, pp. 1314–1318.
S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, “Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in International Conference on Machine Learning, pp. 2031–2041.
G. J. Mysore, “Can we automatically transform speech recorded on common consumer devices in real-world environments into professional production quality speech? A dataset, insights, and challenges,” IEEE Signal Processing Letters, vol. 22, no. 8, pp. 1006–1010, 2015.
J. Traer and J. H. McDermott, “Statistics of natural reverberation enable perceptual separation of sound and space,” Proceedings of the National Academy of Sciences, vol. 113, no. 48, pp. E7856– E7865, 2016.