We propose an end-to-end speech synthesizer, Fast DCTTS, that synthesizes speech in real time on a single CPU thread. The proposed model is composed of a carefully-tuned lightweight network designed by applying multiple network reduction and fidelity improvement techniques. In addition, we propose a novel group highway activation that can compromise between computational efficiency and the regularization effect of the gating mechanism. As well, we introduce a new metric called Elastic mel-cepstral distortion (EMCD) to measure the fidelity of the output mel-spectrogram.
- Categories:
- Read more about NON-PARALLEL MANY-TO-MANY VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM A TEXT-TO-SPEECH MODEL
- Log in to post comments
In this paper, we present a simple but novel framework to train a non-parallel many-to-many voice conversion (VC) model based on the encoder-decoder architecture. It is observed that an encoder-decoder text-to-speech (TTS) model and an encoder-decoder VC model have the same structure.
- Categories:
- Read more about Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset
- Log in to post comments
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles.
icassp_poster.pdf
icassp_slides.pdf
- Categories:
- Read more about FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention
- Log in to post comments
Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectral features of the utterance(s) from the target speaker are obtained from log mel-spectrograms.
ICASSP_FragmentVC.pdf
FragmentVC.pdf
- Categories:
- Read more about Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech
- Log in to post comments
The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples. In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. Among different types of embeddings, the embedding pretrained by voice conversion achieves the best performance.
ICASSP_M2VoC.pdf
- Categories:
- Read more about THE THINKIT SYSTEM FOR ICASSP2021 M2VOC CHALLENGE
- Log in to post comments
- Categories:
Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation.
poster.pdf
slides.pdf
- Categories:
- Read more about Universal Neural Vocoding with Parallel Wavenet
- Log in to post comments
- Categories:
- Read more about Universal Neural Vocoding with Parallel Wavenet
- Log in to post comments
- Categories:
- Read more about Fast and High-Quality Singing Voice Synthesis System based on Convolutional Neural Networks
- Log in to post comments
The present paper describes singing voice synthesis based on convolutional neural networks (CNNs). Singing voice synthesis systems based on deep neural networks (DNNs) are currently being proposed and are improving the naturalness of synthesized singing voices. As singing voices represent a rich form of expression, a powerful technique to model them accurately is required. In the proposed technique, long-term dependencies of singing voices are modeled by CNNs.
- Categories: