Sorry, you need to enable JavaScript to visit this website.

'EMOTIONAL VOICE CONVERSION USING MULTITASK LEARNING WITH TEXT-TO-SPEECH

Citation Author(s):
Tae-Ho Kim, Sungjae Cho, Shinkook Choi, Sejik Park, Soo-Young Lee
Submitted by:
Tae-Ho Kim
Last updated:
14 May 2020 - 1:47am
Document Type:
Presentation Slides
Document Year:
2020
Event:
Presenters:
Tae-Ho Kim
 

Voice conversion (VC) is a task that alters the voice of a person to suit different styles while conserving the linguistic content. Previous state-of-the-art technology used in VC was based on the sequence-to-sequence (seq2seq) model, which could lose linguistic information. There was an attempt to overcome this problem using textual supervision; however, this required explicit alignment, and therefore the benefit of using seq2seq model was lost. In this study, a voice converter that utilizes multitask learning with text-to-speech (TTS) is presented. By using multitask learning, VC is expected to capture linguistic information and preserve the training stability. This method does not require explicit alignment for capturing abundant text information. Experiments on VC were performed on a male-Korean-emotional-text-speech dataset to convert the neutral voice to emotional voice. It was shown that multitask learning helps to preserve the linguistic contents.

up
0 users have voted: