Speech Synthesis and Generation, including TTS (SPE-SYNT)

Cyborg Speech: Deep Multilingual Speech Synthesis for Generating Segmental Foreign Accent with Natural Prosody

We describe a new application of deep-learning-based speech synthesis, namely multilingual speech synthesis for generating controllable foreign accent. Specifically, we train a DBLSTM-based acoustic model on non-accented multilingual speech recordings from a speaker native in several languages. By copying durations and pitch contours from a pre-recorded utterance of the desired prompt, natural prosody is achieved. We call this paradigm "cyborg speech" as it combines human and machine speech parameters.

ghenter_cyborg_talk_20180429.pdf

Cyborg Speech presentation slides (480)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

36 Views

An investigation of subband WaveNet vocoder covering entire audible frequency range with limited acoustic features

Although a WaveNet vocoder can synthesize more natural-sounding speech waveforms than conventional vocoders with sampling frequencies of 16 and 24 kHz, it is difficult to directly extend the sampling frequency to 48 kHz to cover the entire human audible frequency range for higher-quality synthesis because the model size becomes too large to train with a consumer GPU. For a WaveNet vocoder with a sampling frequency of 48 kHz with a consumer GPU, this paper introduces a subband WaveNet architecture to a speaker-dependent WaveNet vocoder and proposes a subband WaveNet vocoder.

ICASSP_2018_subband_WaveNet_vocoder.pdf

ICASSP_2018_subband_WaveNet_vocoder.pdf (931)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

217 Views

NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS

Read more about NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS
Log in to post comments

ICASSP 2018 - Tacotron 2.pdf

ICASSP 2018 - Tacotron 2.pdf (842)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

77 Views

TEXT-TO-SPEECH SYNTHESIS USING STFT SPECTRA BASED ON LOW-/MULTI-RESOLUTION GENERATIVE ADVERSARIAL NETWORKS

saito18icassp_tts.pdf

saito18icassp_tts.pdf (593)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

27 Views

NON-PARALLEL VOICE CONVERSION USING VARIATIONAL AUTOENCODERS CONDITIONED BY PHONETIC POSTERIORGRAMS AND D-VECTORS

saito18icassp_vc_v2.pdf

saito18icassp_vc_v2.pdf (496)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

26 Views

On the use of WaveNet as a Statistical Vocoder

Read more about On the use of WaveNet as a Statistical Vocoder
Log in to post comments

WaveNet_Vocoder_Poster_4cols_v2.pdf

WaveNet_Vocoder_Poster_4cols_v2.pdf (1139)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

15 Views

An Investigation of Noise Shaping with Perceptual Weighting for WaveNet-based Speech Generation

We propose a noise shaping method to improve the sound quality of speech signals generated by WaveNet, which is a convolutional neural network (CNN) that predicts a waveform sample sequence as a discrete symbol sequence. Speech signals generated by WaveNet often suffer from noise signals caused by the quantization error generated by representing waveform samples as discrete symbols and the prediction error of the CNN.

ICASSP2018_NS.pdf

Poster pdf (1019)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

297 Views

On the analysis of training data for wavenet-based speech synthesis

Read more about On the analysis of training data for wavenet-based speech synthesis
Log in to post comments

In this paper, we analyze how much, how consistent and how accurate data WaveNet-based speech synthesis method needs to be abletogeneratespeechofgoodquality. Wedothisbyaddingartiﬁcial noise to the description of our training data and observing how well WaveNet trains and produces speech. More speciﬁcally, we add noise to both phonetic segmentation and annotation accuracy, and we also reduce the size of training data by using a fewer number of sentences during training of a WaveNet model.

poster.pdf

poster.pdf (598)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

36 Views

MODELING-BY-GENERATION-STRUCTURED NOISE COMPENSATION ALGORITHM FOR GLOTTAL VOCODING SPEECH SYNTHESIS SYSTEM

This paper proposes a novel noise compensation algorithm for a glottal excitation model in a deep learning (DL)-based speech synthesis system.
To generate high-quality speech synthesis outputs, the balance between harmonic and noise components of the glottal excitation signal should be well-represented by the DL network.
However, it is hard to accurately model the noise component because the DL training process inevitably results in statistically smoothed outputs; thus, it is essential to introduce an additional noise compensation process.

ICASSP2018_MbG_glottal.pdf

ICASSP2018_MbG_glottal.pdf (729)

ICASSP2018_MbG_glottal.pdf

ICASSP2018_MbG_glottal.pdf (412)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

23 Views

CONVOLUTIONAL SEQUENCE TO SEQUENCE MODEL WITH NON-SEQUENTIAL GREEDY DECODING FOR GRAPHEME TO PHONEME CONVERSION

The greedy decoding method used in the conventional sequence-to-sequence models is prone to producing a model with a compounding
of errors, mainly because it makes inferences in a fixed order, regardless of whether or not the model’s previous guesses are correct.
We propose a non-sequential greedy decoding method that generalizes the greedy decoding schemes proposed in the past. The proposed
method determines not only which token to consider, but also which position in the output sequence to infer at each inference step.

NSGD_poster_at_ICASSP2018_v1.1.pdf

NSGD_poster_at_ICASSP2018_v1.1.pdf (731)

Categories:: Applications in Music and Audio Processing (MLR-MUSI)
Speech Synthesis and Generation, including TTS (SPE-SYNT)

379 Views

Speech Synthesis and Generation, including TTS (SPE-SYNT)

Pages