Speech Synthesis and Generation, including TTS (SPE-SYNT)

Transformer-based text-to-speech with weighted forced attention

Read more about Transformer-based text-to-speech with weighted forced attention
Log in to post comments

This paper investigates state-of-the-art Transformer- and FastSpeech-based high-fidelity neural text-to-speech (TTS) with full-context label input for pitch accent languages. The aim is to realize faster training than conventional Tacotron-based models. Introducing phoneme durations into Tacotron-based TTS models improves both synthesis quality and stability.

ICASSP_2020_okamoto.pdf

ICASSP_2020_okamoto.pdf (510)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

289 Views

AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

This paper describes a method based on a sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis and recognition, machine translation, and image captioning.

2019_05_ICASSP_KouTanaka.pdf

2019_05_ICASSP_KouTanaka.pdf (740)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)
Spoken Language Processing

73 Views

An End-to-End Network to Synthesize Intonation using a Generalized Command Response Model - Poster

The generalized command response (GCR) model represents intonation as a
superposition of muscle responses to spike command signals. We have previously
shown that the spikes can be predicted by a two-stage system, consisting of a recurrent neural network and a post-processing procedure, but the responses themselves were fixed dictionary atoms. We propose an end-to-end
neural architecture that replaces the dictionary atoms with trainable
second-order recurrent elements analogous to recursive filters. We demonstrate

An End-to-End Network to Synthesize Intonation using a Generalized Command Response Model Poster.pdf

Presentation Poster (527)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)
Neural network learning (MLR-NNLR)

163 Views

Investigations of real-time Gaussian FFTNet and parallel WaveNet neural vocoders with simple acoustic features

This paper examines four approaches to improving real-time neural vocoders with simple acoustic features (SAF) constructed from fundamental frequency and mel-cepstra rather than mel-spectrograms.

icassp_2019_okamoto_1.pdf

icassp_2019_okamoto_1.pdf (723)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

348 Views

CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

Read more about CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion
Log in to post comments

Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training conditions. Recently, CycleGAN-VC has provided a breakthrough and performed comparably to a parallel VC method without relying on any extra data, modules, or time alignment procedures. However, there is still a large gap between the real target and converted speech, and bridging this gap remains a challenge.

Kaneko_CycleGAN-VC2_ICASSP_2019_poster.pdf

Kaneko_CycleGAN-VC2_ICASSP_2019_poster.pdf (480)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

67 Views

CROSS-LINGUAL VOICE CONVERSION WITH BILINGUAL PHONETIC POSTERIORGRAM AND AVERAGE MODELING

This paper presents a cross-lingual voice conversion approach using bilingual Phonetic PosteriorGram (PPG) and average modeling. The proposed approach makes use of bilingual PPGs to represent speaker-independent features of speech signals from different languages in the same feature space. In particular, a bilingual PPG is formed by stacking two monolingual PPG vectors, which are extracted from two monolingual speech recognition systems. The conversion model is trained to learn the relationship between bilingual PPGs and the corresponding acoustic features.

Poster_ICASSP2019.pdf

cross lingual voice conversion with bilingual PPG (500)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

70 Views

POSTER OF PAPER 3809 (SLP-P20)

Read more about POSTER OF PAPER 3809 (SLP-P20)
Log in to post comments

Poster presented at the poster session "Speech Synthesis II" of ICASSP 2019 of the paper "ENHANCED VIRTUAL SINGERS GENERATION BY INCORPORATING SINGING DYNAMICS TO PERSONALIZED TEXT-to-SPEECH-to-SINGING"

POSTER_PAPER_3809.pdf

POSTER_PAPER_3809.pdf (423)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

26 Views

Cycle-consistent adversarial networks for non-parallel vocal effort based speaking style conversion

Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we propose the use of cycle-consistent adversarial networks (CycleGANs) for converting styles with varying vocal effort, and focus on conversion between normal and Lombard styles as a case study of this problem. We propose a parametric approach that uses the Pulse Model in Log domain (PML) vocoder to extract speech features. These features are mapped using the CycleGAN from utterances in the source style to the corresponding features of target speech.

Seshadri_ICASSP2019_final.pdf

Seshadri_ICASSP2019_final.pdf (479)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

25 Views

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS

Deep neural networks (DNNs) have been successfully deployed for acoustic modelling in statistical parametric speech synthesis (SPSS) systems. Moreover, DNN-based postfilters (PF) have also been shown to outperform conventional postfilters that are widely used in SPSS systems for increasing the quality of synthesized speech. However, existing DNN-based postfilters are trained with speaker-dependent databases. Given that SPSS systems can rapidly adapt to new speakers from generic models, there is a need for DNN-based postfilters that can adapt to new speakers with minimal adaptation data.