Controllable Speech Representation Learning
via Voice Conversion and AIC Loss

Yunyun Wang, Jiaqi Su, Adam Finkelstein, Zeyu Jin

Coming soon: [Paper] [GitHub]

Methods
For details please refer to our paper.

Input:   input audio
Ours-AC:   our method, with absolute pitch and AIC loss
Ours-AC-48k:   our method, with absolute pitch and AIC loss, bandwidth extended to 448k
Ours-A:   our method, with absolute pitch
Ours-AC-noGAN:   our method, with absolute pitch and AIC loss, but without GAN training (all other ours are with GAN)
Ours-AC2:   our method, with absolute pitch and AIC loss 2
Ours-ACE:   our method, with absolute pitch, energy condition and AIC loss
Ours-R:   our method, with relative pitch
Ours-RC:   our method, with relative pitch and AIC loss
PitchShift[1]:   A pitch shifter using psola
AutoVC-F0[2]:   AutoVC-F0, an autoencoder voice conversion method
Wav2Vec[3]:   Waw2Vec features voice reconstruction
Target:   target audio

MOS and Similarity Scores
MOS scores and similarity scores shows that our best models compare favorably with baselines across gender and seen/unseen speaker conversion cases.
Code Space Comparison
(a)
(b)
(c)
(d)
(e)
(f)

The same sentence uttered by two different speakers (a,d); after pitch shift (b,e); and transcription (c,f). Click PLAY to play the corresponding audio.
Audio Samples (VCTK Dataset) >>>More Samples