HiFi-GAN-2: Studio-quality Speech Enhancement
via Generative Adversarial Networks Conditioned on Acoustic Features

Jiaqi Su, Zeyu Jin, Adam Finkelstein

Real Demo for Ted Talk

Original input:

HiFi-GAN-2 result at 48k:

HiFi-GAN (previous work) result at 48k:



Real Demo for VCTK Noisy

Original input:

HiFi-GAN-2 result at 48k:

HiFi-GAN (previous work) result at 48k:



Real Demo for DAPS

Original input:

HiFi-GAN-2 result at 48k:

HiFi-GAN (previous work) result at 48k:

* Using a model trained on our augmented synthetic dataset with speech corpus from the DAPS Dataset [5] and room impulse responses from MIT IR Survey Dataset [6].

SAMPLES
Loading......
REFERENCES
  1. X. Hao, X. Su, R. Horaud, and X. Li, “Fullsubnet: A fullband and sub-band fusion model for real-time single-channel speech enhancement,” arXiv:2010.15508, 2020.
  2. A. D´efossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” Proc. Interspeech 2020, pp. 3291–3295.
  3. A. Polyak, L. Wolf, Y. Adi, O. Kabeli, and Y. Taigman, “High fidelity speech regeneration with application to speech enhancement,” arXiv preprint arXiv:2102.00429, 2021.
  4. J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN: high-fidelity denoising and dereverberation based on speech deep features in adversarial networks,” in Interspeech 2020.
  5. G. J. Mysore, “Can we automatically transform speech recorded on common consumer devices in real-world environments into professional production quality speech? A dataset, insights, and challenges,” IEEE Signal Processing Letters, vol. 22, no. 8, pp. 1006–1010, 2015.
  6. J. Traer and J. H. McDermott, “Statistics of natural reverberation enable perceptual separation of sound and space,” Proceedings of the National Academy of Sciences, vol. 113, no. 48, pp. E7856– E7865, 2016.