Voice Conversion DAPS Listening Test Samples (Unseen Speakers and Utterances)
ID: The source speaker and utterance id x the target speaker and utterance id.
Source: The source audio (content source).
Target: The target audio (target speaker reference).
Ours-TFdecoder: Our method.
Ours-TFdecoder+: Our method enhanced with HiFiGAN 2 [37] for even higher quality.
Resemblyzer: A baseline trained by replacing our GR0-encoder with Resemblyzer [8], a speaker embedding model.
YourTTS: A baseline from the state-of-the-art multi-speaker TTS model [15], evaluated in zero-shot voice conversion setting.
AutovcAIC: A baseline encoder-decoder model with AIC loss [13].
AutovcF0: A baseline encoder-decoder model with F0 condition [12].
For more details, please refer to Section 4.1.
ID |
Source |
Target |
Ours-TFdecoder |
Ours-TFdecoder+ |
Resemblyzer |
YourTTS |
AutovcAIC |
AutovcF0 |
f1u1xf1u2 |
|
|
|
|
|
|
|
|
f1u1xf2u2 |
|
|
|
|
|
|
|
|
f1u1xf3u2 |
|
|
|
|
|
|
|
|
f1u1xf4u2 |
|
|
|
|
|
|
|
|
f1u1xf5u2 |
|
|
|
|
|
|
|
|
f1u1xm1u2 |
|
|
|
|
|
|
|
|
f1u1xm2u2 |
|
|
|
|
|
|
|
|
f1u1xm3u2 |
|
|
|
|
|
|
|
|
f1u1xm4u2 |
|
|
|
|
|
|
|
|
f1u1xm5u2 |
|
|
|
|
|
|
|
|