Sorry, you need to enable JavaScript to visit this website.

We describe a new application of deep-learning-based speech synthesis, namely multilingual speech synthesis for generating controllable foreign accent. Specifically, we train a DBLSTM-based acoustic model on non-accented multilingual speech recordings from a speaker native in several languages. By copying durations and pitch contours from a pre-recorded utterance of the desired prompt, natural prosody is achieved. We call this paradigm "cyborg speech" as it combines human and machine speech parameters.


Although a WaveNet vocoder can synthesize more natural-sounding speech waveforms than conventional vocoders with sampling frequencies of 16 and 24 kHz, it is difficult to directly extend the sampling frequency to 48 kHz to cover the entire human audible frequency range for higher-quality synthesis because the model size becomes too large to train with a consumer GPU. For a WaveNet vocoder with a sampling frequency of 48 kHz with a consumer GPU, this paper introduces a subband WaveNet architecture to a speaker-dependent WaveNet vocoder and proposes a subband WaveNet vocoder.


We propose a noise shaping method to improve the sound quality of speech signals generated by WaveNet, which is a convolutional neural network (CNN) that predicts a waveform sample sequence as a discrete symbol sequence. Speech signals generated by WaveNet often suffer from noise signals caused by the quantization error generated by representing waveform samples as discrete symbols and the prediction error of the CNN.


In this paper, we analyze how much, how consistent and how accurate data WaveNet-based speech synthesis method needs to be abletogeneratespeechofgoodquality. Wedothisbyaddingartificial noise to the description of our training data and observing how well WaveNet trains and produces speech. More specifically, we add noise to both phonetic segmentation and annotation accuracy, and we also reduce the size of training data by using a fewer number of sentences during training of a WaveNet model.


This paper proposes a novel noise compensation algorithm for a glottal excitation model in a deep learning (DL)-based speech synthesis system.
To generate high-quality speech synthesis outputs, the balance between harmonic and noise components of the glottal excitation signal should be well-represented by the DL network.
However, it is hard to accurately model the noise component because the DL training process inevitably results in statistically smoothed outputs; thus, it is essential to introduce an additional noise compensation process.


The greedy decoding method used in the conventional sequence-to-sequence models is prone to producing a model with a compounding
of errors, mainly because it makes inferences in a fixed order, regardless of whether or not the model’s previous guesses are correct.
We propose a non-sequential greedy decoding method that generalizes the greedy decoding schemes proposed in the past. The proposed
method determines not only which token to consider, but also which position in the output sequence to infer at each inference step.