Sorry, you need to enable JavaScript to visit this website.

Improving LPCNet-based Text-to-Speech with Linear Prediction-structured Mixture Density Network

Citation Author(s):
Min-Jae Hwang, Eunwoo Song, Ryuichi Yamamoto, Frank Soong, and Hong-Goo Kang
Submitted by:
Min-jae Hwang
Last updated:
14 May 2020 - 2:40am
Document Type:
Presentation Slides
Event:
 

In this paper, we propose an improved LPCNet vocoder using a linear prediction (LP)-structured mixture density network (MDN).
The recently proposed LPCNet vocoder has successfully achieved high-quality and lightweight speech synthesis systems by combining a vocal tract LP filter with a WaveRNN-based vocal source (i.e., excitation) generator.
However, the quality of synthesized speech is often unstable because the vocal source component is insufficiently represented by the mu-law quantization method, and the model is trained without considering the entire speech production mechanism.
To address this problem, we first introduce LP-MDN, which enables the autoregressive neural vocoder to structurally represent the interactions between the vocal tract and vocal source components.
Then, we propose to incorporate the LP-MDN to the LPCNet vocoder by replacing the conventional discretized output with continuous density distribution.
The experimental results verify that the proposed system provides high quality synthetic speech by achieving a mean opinion score of 4.41 within a text-to-speech framework.

up
0 users have voted: