Voice Conversion DAPS Listening Test Samples (Unseen Speakers and Utterances)

ID: The source speaker and utterance id x the target speaker and utterance id.

Source: The source audio (content source).

Target: The target audio (target speaker reference).

Ours-TFdecoder: Our method.

Ours-TFdecoder+: Our method enhanced with HiFiGAN 2 [37] for even higher quality.

Resemblyzer: A baseline trained by replacing our GR0-encoder with Resemblyzer [8], a speaker embedding model.

YourTTS: A baseline from the state-of-the-art multi-speaker TTS model [15], evaluated in zero-shot voice conversion setting.

AutovcAIC: A baseline encoder-decoder model with AIC loss [13].

AutovcF0: A baseline encoder-decoder model with F0 condition [12].

For more details, please refer to Section 4.1.

ID Source Target Ours-TFdecoder Ours-TFdecoder+ Resemblyzer YourTTS AutovcAIC AutovcF0

f1u1xf1u2

f1u1xf2u2

f1u1xf3u2

f1u1xf4u2

f1u1xf5u2

f1u1xm1u2

f1u1xm2u2

f1u1xm3u2

f1u1xm4u2

f1u1xm5u2