Speech Synthesis and Generation, including TTS (SPE-SYNT)

FastDCTTS: Efficient Deep Convolutional Text-to-Speech

Read more about FastDCTTS: Efficient Deep Convolutional Text-to-Speech
Log in to post comments

We propose an end-to-end speech synthesizer, Fast DCTTS, that synthesizes speech in real time on a single CPU thread. The proposed model is composed of a carefully-tuned lightweight network designed by applying multiple network reduction and fidelity improvement techniques. In addition, we propose a novel group highway activation that can compromise between computational efficiency and the regularization effect of the gating mechanism. As well, we introduce a new metric called Elastic mel-cepstral distortion (EMCD) to measure the fidelity of the output mel-spectrogram.

IEEE-ICASSP2021_FastDCTTS(4829)_final_wo_video.pdf

FastDCTTS presentation slide (318)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

37 Views

NON-PARALLEL MANY-TO-MANY VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM A TEXT-TO-SPEECH MODEL

In this paper, we present a simple but novel framework to train a non-parallel many-to-many voice conversion (VC) model based on the encoder-decoder architecture. It is observed that an encoder-decoder text-to-speech (TTS) model and an encoder-decoder VC model have the same structure.

poster.pdf

ICASSP2021 YU and MAK poster on non-parallel many-to-many VC (255)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

26 Views

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles.

icassp_poster.pdf

Poster (398)

icassp_slides.pdf

Slides (412)

Categories:: Audio Analysis and Synthesis
Speech Synthesis and Generation, including TTS (SPE-SYNT)

42 Views

FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectral features of the utterance(s) from the target speaker are obtained from log mel-spectrograms.

ICASSP_FragmentVC.pdf

Slides (312)

FragmentVC.pdf

Poster (251)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

18 Views

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples. In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. Among different types of embeddings, the embedding pretrained by voice conversion achieves the best performance.

ICASSP_M2VoC.pdf

Slides (310)

M2VoC.pdf

Poster (210)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

14 Views

THE THINKIT SYSTEM FOR ICASSP2021 M2VOC CHALLENGE

Read more about THE THINKIT SYSTEM FOR ICASSP2021 M2VOC CHALLENGE
Log in to post comments

ZengqiangShang.pptx

ZengqiangShang.pptx (277)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

31 Views

CAMP: A Two-Stage Approach To Modelling Prosody In Context

Read more about CAMP: A Two-Stage Approach To Modelling Prosody In Context
Log in to post comments

Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation.

poster.pdf

poster.pdf (255)

slides.pdf

slides.pdf (227)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

12 Views

Universal Neural Vocoding with Parallel Wavenet

Read more about Universal Neural Vocoding with Parallel Wavenet
Log in to post comments

icassp2021_universal_vocoding_with_pw.pdf

icassp2021_universal_vocoding_with_pw.pdf (260)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

5 Views

Universal Neural Vocoding with Parallel Wavenet

Read more about Universal Neural Vocoding with Parallel Wavenet
Log in to post comments

poster_a0_landscape.pdf

poster_a0_landscape.pdf (396)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

12 Views

Fast and High-Quality Singing Voice Synthesis System based on Convolutional Neural Networks

The present paper describes singing voice synthesis based on convolutional neural networks (CNNs). Singing voice synthesis systems based on deep neural networks (DNNs) are currently being proposed and are improving the naturalness of synthesized singing voices. As singing voices represent a rich form of expression, a powerful technique to model them accurately is required. In the proposed technique, long-term dependencies of singing voices are modeled by CNNs.

ICASSP2020_slide_20200417b.pdf

ICASSP2020_slide_20200417b.pdf (444)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)
Neural network learning (MLR-NNLR)

119 Views

Speech Synthesis and Generation, including TTS (SPE-SYNT)

Pages