An investigation of subband WaveNet vocoder covering entire audible frequency range with limited acoustic features

Although a WaveNet vocoder can synthesize more natural-sounding speech waveforms than conventional vocoders with sampling frequencies of 16 and 24 kHz, it is difficult to directly extend the sampling frequency to 48 kHz to cover the entire human audible frequency range for higher-quality synthesis because the model size becomes too large to train with a consumer GPU. For a WaveNet vocoder with a sampling frequency of 48 kHz with a consumer GPU, this paper introduces a subband WaveNet architecture to a speaker-dependent WaveNet vocoder and proposes a subband WaveNet vocoder. In experiments, each conditional subband WaveNet with a sampling frequency of 8 kHz was well trained using a consumer GPU. The results of subjective evaluations with a Japanese male speech corpus indicate that the proposed subband WaveNet vocoder with 36-dimensional simple acoustic features significantly outperformed the conventional source-filter model-based vocoders including STRAIGHT with 86-dimensional features.

ICASSP_2018_subband_WaveNet_vocoder.pdf

ICASSP_2018_subband_WaveNet_vocoder.pdf (824)

Thumbs Up

CITE

Documents

Poster

An investigation of subband WaveNet vocoder covering entire audible frequency range with limited acoustic features

ICASSP_2018_subband_WaveNet_vocoder.pdf

QUESTIONS?