Voice Conversion DAPS Listening Test Samples (Unseen Speakers and Utterances)

ID: The source speaker and utterance id x the target speaker and utterance id.

Source: The source audio (content source).

Target: The target audio (target speaker reference).

Ours-TFdecoder: Our method.

Ours-TFdecoder+: Our method enhanced with HiFiGAN 2 [37] for even higher quality.

Resemblyzer: A baseline trained by replacing our GR0-encoder with Resemblyzer [8], a speaker embedding model.

YourTTS: A baseline from the state-of-the-art multi-speaker TTS model [15], evaluated in zero-shot voice conversion setting.

AutovcAIC: A baseline encoder-decoder model with AIC loss [13].

AutovcF0: A baseline encoder-decoder model with F0 condition [12].

For more details, please refer to Section 4.1.