Bandwidth Extension is All You Need

Jiaqi Su, Yunyun Wang, Adam Finkelstein, Zeyu Jin

Speech generation and enhancement have seen recent breakthroughs in quality thanks to deep learning. These methods typically operate at a limited sampling rate of 16-22kHz due to computational complexity and available datasets. This limitation imposes a gap between the output of such methods and that of high-fidelity (>=44kHz) real-world audio applications. This paper proposes a new bandwidth extension (BWE) method that expands 8-16kHz speech signals to 48kHz. The method is based on a feed-forward WaveNet architecture trained with a GAN-based deep feature loss. A mean-opinion-score (MOS) experiment shows significant improvement in quality over state-of-the-art BWE methods. An AB test reveals that our 16-to-48kHz BWE is able to achieve fidelity that is typically indistinguishable from real high-fidelity recordings. We use our method to enhance the output of recent speech generation and denoising methods, and experiments demonstrate significant improvement in sound quality over these baselines. We propose this as a general approach to narrow the gap between generated speech and recorded speech, without the need to adapt such methods to higher sampling rates.

Back to main page

The CMU Arctic Dataset: Vocoding + BWE

Choose a speaker:

Choose a script:

Loading......

REFERENCES

WaveNet [1]: J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., “NaturalTTS synthesis by conditioning WaveNet on mel spectrogrampredictions,” inICASSP 2018
HiNet [13]: Y. Ai and Z.-H. Ling, “A neural vocoder with hierarchical gen-eration of amplitude and phase spectra for statistical parametricspeech synthesis,”TASLP, vol. 28, 2020.
WaveRNN [12]: N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N.Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord, S. Diele-man, and K. Kavukcuoglu, “Efficient neural audio synthesis,”arXiv:1802.08435, 2018.