Input: input audio Ours-AC: our method, with absolute pitch and AIC loss Ours-AC-48k: our method, with absolute pitch and AIC loss, bandwidth extended to 448k Ours-A: our method, with absolute pitch Ours-AC-noGAN: our method, with absolute pitch and AIC loss, but without GAN training (all other ours are with GAN) Ours-AC2: our method, with absolute pitch and AIC loss 2 Ours-ACE: our method, with absolute pitch, energy condition and AIC loss Ours-R: our method, with relative pitch Ours-RC: our method, with relative pitch and AIC loss PitchShift[1]: A pitch shifter using psola AutoVC-F0[2]: AutoVC-F0, an autoencoder voice conversion method Wav2Vec[3]: Waw2Vec features voice reconstruction Target: target audio
MOS and Similarity Scores
MOS scores and similarity scores shows that our best models compare favorably with baselines across gender and seen/unseen speaker conversion cases.
Code Space Comparison
(a)
(b)
(c)
(d)
(e)
(f)
The same sentence uttered by two different speakers (a,d); after pitch shift (b,e); and transcription (c,f). Click PLAY to play the corresponding audio.
Audio Samples (VCTK Dataset)
Loading......
REFERENCES
Psola [1]: F. Charpentier and M. Stella, “Diphone synthesis using an overlap-add technique for speech waveforms concatenation,” in ICASSP ’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1986, vol. 11, pp. 2015–2018.
AutoVC-F0 [2]: Kaizhi Qian, Zeyu Jin, Mark Hasegawa-Johnson, and Gautham J. Mysore, “F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder,” in ICASSP, 2020, pp. 6284–6288.
Wav2Vec [3]: Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Info. Proc. Sys., 2020, vol. 33, pp. 12449–12460.