Documents
Poster
NON-PARALLEL MANY-TO-MANY VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM A TEXT-TO-SPEECH MODEL
- Citation Author(s):
- Submitted by:
- Brian Mak
- Last updated:
- 22 June 2021 - 6:45am
- Document Type:
- Poster
- Document Year:
- 2021
- Event:
- Presenters:
- Xinyuan YU
- Paper Code:
- SPE-11.2
- Categories:
- Log in to post comments
In this paper, we present a simple but novel framework to train a non-parallel many-to-many voice conversion (VC) model based on the encoder-decoder architecture. It is observed that an encoder-decoder text-to-speech (TTS) model and an encoder-decoder VC model have the same structure. Thus, we propose to pre-train a multi-speaker encoder-decoder TTS model and transfer knowledge from the TTS model to a VC model by (1) adopting the TTS acoustic decoder as the VC acoustic decoder, and (2) forcing the VC speech encoder to learn the same speaker-agnostic linguistic features from the TTS text encoder so as to achieve speaker disentanglement in the VC encoder output. We further control the conversion of the pitch contour from source speech to target speech, and condition the VC decoder on the converted pitch contour during inference. Subjective evaluation shows that our proposed model is able to handle VC between any speaker pairs in the training speech corpus of over 200 speakers with high naturalness and speaker similarity.