Speech Synthesis and Generation, including TTS (SPE-SYNT)

EPOCH EXTRACTION FROM A SPEECH SIGNAL USING GAMMATONE WAVELETS IN A SCATTERING NETWORK

In speech production, epochs are glottal closure instants where significant energy is released from the lungs. Extracting an epoch accurately is important in speech synthesis, analysis, and pitch oriented studies. The time-varying characteristics of the source and the system, and channel attenuation of low-frequency components by telephone channels make estimation of epoch from a speech signal a challenging task.

Epoch_estimation_ICASSP_2020_v1.pdf

Epoch Extraction using gammatone wavelets (342)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

32 Views

ICASSP 2020 Presentation Poster Slides

Read more about ICASSP 2020 Presentation Poster Slides
Log in to post comments

ONE-SHOT VOICE CONVERSION USING STAR-GAN

A0_poster_new.pptx

A0_poster_new.pptx (905)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

815 Views

Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

Read more about Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis
Log in to post comments

Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attention mechanisms: location-relative GMM-based mechanisms and additive energy-based mechanisms.

Location-Relative Attention (slides).pdf

Location-Relative Attention (slides).pdf (662)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

26 Views

Improving LPCNet-based Text-to-Speech with Linear Prediction-structured Mixture Density Network

In this paper, we propose an improved LPCNet vocoder using a linear prediction (LP)-structured mixture density network (MDN).
The recently proposed LPCNet vocoder has successfully achieved high-quality and lightweight speech synthesis systems by combining a vocal tract LP filter with a WaveRNN-based vocal source (i.e., excitation) generator.

20200507_minjae.pdf

20200507_minjae.pdf (371)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

19 Views

'EMOTIONAL VOICE CONVERSION USING MULTITASK LEARNING WITH TEXT-TO-SPEECH

Read more about 'EMOTIONAL VOICE CONVERSION USING MULTITASK LEARNING WITH TEXT-TO-SPEECH
Log in to post comments

Voice conversion (VC) is a task that alters the voice of a person to suit different styles while conserving the linguistic content. Previous state-of-the-art technology used in VC was based on the sequence-to-sequence (seq2seq) model, which could lose linguistic information. There was an attempt to overcome this problem using textual supervision; however, this required explicit alignment, and therefore the benefit of using seq2seq model was lost. In this study, a voice converter that utilizes multitask learning with text-to-speech (TTS) is presented.

ICASSP_v0.1.pdf

Slides (441)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

31 Views

PARALLEL WAVEGAN: A FAST WAVEFORM GENERATION MODEL BASED ON GENERATIVE ADVERSARIAL NETWORKS WITH MULTI-RESOLUTION SPECTROGRAM

We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the entire model can be easily trained.

20200418_icassp2020_paralll_wavegan-final.pdf

Final presentation slides (674)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

216 Views

IMPROVING PROSODY WITH LINGUISTIC AND BERT DERIVED FEATURES IN MULTI-SPEAKER BASED MANDARIN CHINESE NEURAL TTS

Recent advances of neural TTS have made “human parity” synthesized speech possible when a large amount of studio-quality training data from a voice talent is available. However, with only limited, casual recordings from an ordinary speaker, human-like TTS is still a big challenge, in addition to other artifacts like incomplete sentences, repetition of words, etc.

Slides_icassp2020_upload.pptx

Slides_icassp2020_upload.pptx (720)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

82 Views

A HYBRID TEXT NORMALIZATION SYSTEM USING MULTI-HEAD SELF-ATTENTION FOR MANDARIN

Read more about A HYBRID TEXT NORMALIZATION SYSTEM USING MULTI-HEAD SELF-ATTENTION FOR MANDARIN
Log in to post comments

In this paper, we propose a hybrid text normalization system using multi-head self-attention. The system combines the advantages of a rule-based model and a neural model for text preprocessing tasks. Previous studies in Mandarin text normalization usually use a set of hand-written rules, which are hard to improve on general cases. The idea of our proposed system is motivated by the neural models from recent studies and has a better performance on our internal news corpus. This paper also includes different attempts to deal with imbalanced pattern distribution of the dataset.