Sorry, you need to enable JavaScript to visit this website.

NON-PARALLEL MANY-TO-MANY VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM A TEXT-TO-SPEECH MODEL

Citation Author(s):
Submitted by:
Brian Mak
Last updated:
22 June 2021 - 6:45am
Document Type:
Poster
Document Year:
2021
Event:
Presenters:
Xinyuan YU
Paper Code:
SPE-11.2
 

In this paper, we present a simple but novel framework to train a non-parallel many-to-many voice conversion (VC) model based on the encoder-decoder architecture. It is observed that an encoder-decoder text-to-speech (TTS) model and an encoder-decoder VC model have the same structure. Thus, we propose to pre-train a multi-speaker encoder-decoder TTS model and transfer knowledge from the TTS model to a VC model by (1) adopting the TTS acoustic decoder as the VC acoustic decoder, and (2) forcing the VC speech encoder to learn the same speaker-agnostic linguistic features from the TTS text encoder so as to achieve speaker disentanglement in the VC encoder output. We further control the conversion of the pitch contour from source speech to target speech, and condition the VC decoder on the converted pitch contour during inference. Subjective evaluation shows that our proposed model is able to handle VC between any speaker pairs in the training speech corpus of over 200 speakers with high naturalness and speaker similarity.

up
0 users have voted: