Sorry, you need to enable JavaScript to visit this website.

We propose a noise shaping method to improve the sound quality of speech signals generated by WaveNet, which is a convolutional neural network (CNN) that predicts a waveform sample sequence as a discrete symbol sequence. Speech signals generated by WaveNet often suffer from noise signals caused by the quantization error generated by representing waveform samples as discrete symbols and the prediction error of the CNN.


In this paper, we analyze how much, how consistent and how accurate data WaveNet-based speech synthesis method needs to be abletogeneratespeechofgoodquality. Wedothisbyaddingartificial noise to the description of our training data and observing how well WaveNet trains and produces speech. More specifically, we add noise to both phonetic segmentation and annotation accuracy, and we also reduce the size of training data by using a fewer number of sentences during training of a WaveNet model.


This paper proposes a novel noise compensation algorithm for a glottal excitation model in a deep learning (DL)-based speech synthesis system.
To generate high-quality speech synthesis outputs, the balance between harmonic and noise components of the glottal excitation signal should be well-represented by the DL network.
However, it is hard to accurately model the noise component because the DL training process inevitably results in statistically smoothed outputs; thus, it is essential to introduce an additional noise compensation process.


The greedy decoding method used in the conventional sequence-to-sequence models is prone to producing a model with a compounding
of errors, mainly because it makes inferences in a fixed order, regardless of whether or not the model’s previous guesses are correct.
We propose a non-sequential greedy decoding method that generalizes the greedy decoding schemes proposed in the past. The proposed
method determines not only which token to consider, but also which position in the output sequence to infer at each inference step.


Speaking style plays an important role in the expressivity of speech for communication. Hence speaking style is very important for synthetic speech as well. Speaking style adaptation faces the difficulty that the data of specific styles may be limited and difficult to obtain in large amounts. A possible solution is to leverage data from speaking styles that are more available, to train the speech synthesizer and then adapt it to the target style for which the data is scarce.


In previous work we presented a Sparse, Anchor-Based Representation of speech (SABR) that uses phonemic “anchors” to represent an utterance with a set of sparse non-negative weights. SABR is speaker-independent: combining weights from a source speaker with anchors from a target speaker can be used for voice conversion. Here, we present an extension of the original SABR that significantly improves voice conversion synthesis.