Documents
Presentation Slides
Transformer-based text-to-speech with weighted forced attention
- Citation Author(s):
- Submitted by:
- Takuma Okamoto
- Last updated:
- 6 May 2020 - 9:36pm
- Document Type:
- Presentation Slides
- Document Year:
- 2020
- Event:
- Presenters:
- Takuma Okamoto
- Paper Code:
- SPE-P3.10
- Categories:
- Log in to post comments
This paper investigates state-of-the-art Transformer- and FastSpeech-based high-fidelity neural text-to-speech (TTS) with full-context label input for pitch accent languages. The aim is to realize faster training than conventional Tacotron-based models. Introducing phoneme durations into Tacotron-based TTS models improves both synthesis quality and stability. Therefore, a Transformer-based acoustic model with weighted forced attention obtained from phoneme durations is proposed to improve synthesis accuracy and stability, where both encoder–decoder attention and forced attention are used with a weighting factor. Furthermore, FastSpeech without a duration predictor, in which the phoneme durations are predicted by another conventional model, is also investigated. The results of experiments using a Japanese female corpus and the WaveGlow vocoder indicate that the proposed Transformer using forced attention with a weighting factor of 0.5 outperforms other models, and removing the duration predictor from FastSpeech improves synthesis quality, although the proposed weighted forced attention does not improve synthesis stability.
IEEE Xplore: https://ieeexplore.ieee.org/document/9053915
Presentation video: https://confcats-event-sessions.s3.amazonaws.com/icassp20/videos/2269.mp4
Demo samples: https://ast-astrec.nict.go.jp/demo_samples/icassp_2020_okamoto/index.html