Sorry, you need to enable JavaScript to visit this website.

facebooktwittermailshare

Transformer-based text-to-speech with weighted forced attention

Abstract: 

This paper investigates state-of-the-art Transformer- and FastSpeech-based high-fidelity neural text-to-speech (TTS) with full-context label input for pitch accent languages. The aim is to realize faster training than conventional Tacotron-based models. Introducing phoneme durations into Tacotron-based TTS models improves both synthesis quality and stability. Therefore, a Transformer-based acoustic model with weighted forced attention obtained from phoneme durations is proposed to improve synthesis accuracy and stability, where both encoder–decoder attention and forced attention are used with a weighting factor. Furthermore, FastSpeech without a duration predictor, in which the phoneme durations are predicted by another conventional model, is also investigated. The results of experiments using a Japanese female corpus and the WaveGlow vocoder indicate that the proposed Transformer using forced attention with a weighting factor of 0.5 outperforms other models, and removing the duration predictor from FastSpeech improves synthesis quality, although the proposed weighted forced attention does not improve synthesis stability.

IEEE Xplore: https://ieeexplore.ieee.org/document/9053915

Presentation video: https://confcats-event-sessions.s3.amazonaws.com/icassp20/videos/2269.mp4

Demo samples: https://ast-astrec.nict.go.jp/demo_samples/icassp_2020_okamoto/index.html

up
0 users have voted:

Paper Details

Authors:
Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai
Submitted On:
6 May 2020 - 9:36pm
Short Link:
Type:
Presentation Slides
Event:
Presenter's Name:
Takuma Okamoto
Paper Code:
SPE-P3.10
Document Year:
2020
Cite

Document Files

ICASSP_2020_okamoto.pdf

(25)

Subscribe

[1] Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai, "Transformer-based text-to-speech with weighted forced attention", IEEE SigPort, 2020. [Online]. Available: http://sigport.org/5128. Accessed: Jun. 06, 2020.
@article{5128-20,
url = {http://sigport.org/5128},
author = {Takuma Okamoto; Tomoki Toda; Yoshinori Shiga; Hisashi Kawai },
publisher = {IEEE SigPort},
title = {Transformer-based text-to-speech with weighted forced attention},
year = {2020} }
TY - EJOUR
T1 - Transformer-based text-to-speech with weighted forced attention
AU - Takuma Okamoto; Tomoki Toda; Yoshinori Shiga; Hisashi Kawai
PY - 2020
PB - IEEE SigPort
UR - http://sigport.org/5128
ER -
Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai. (2020). Transformer-based text-to-speech with weighted forced attention. IEEE SigPort. http://sigport.org/5128
Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai, 2020. Transformer-based text-to-speech with weighted forced attention. Available at: http://sigport.org/5128.
Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai. (2020). "Transformer-based text-to-speech with weighted forced attention." Web.
1. Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai. Transformer-based text-to-speech with weighted forced attention [Internet]. IEEE SigPort; 2020. Available from : http://sigport.org/5128