Speech Synthesis and Generation, including TTS (SPE-SYNT)

SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DYSARTHRIC SPEECH RECOGNITION

ICASSP2022With UKY template.pdf

ICASSP2022With UKY template.pdf (194)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

7 Views

DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning

Read more about DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning
Log in to post comments

DRVC-presentation.pdf

DRVC-presentation.pdf (194)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

12 Views

ROBUST DISENTANGLED VARIATIONAL SPEECH REPRESENTATION LEARNING FOR ZERO-SHOT VOICE CONVERSION

Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good voice conversion quality is obtained by exploring better alignment modules or expressive mapping functions. In this study, we investigate zero-shot VC from a novel perspective of self-supervised disentangled speech representation learning. Specifically, we achieve the disentanglement by balancing the information flow between global speaker representation and time-varying content representation in a sequential variational autoencoder (VAE).

VC-2022icassp.pdf

VC-2022icassp.pdf (257)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

27 Views

Unsupervised Word-level Prosody Tagging for Controllable Speech Synthesis

Read more about Unsupervised Word-level Prosody Tagging for Controllable Speech Synthesis
Log in to post comments

poster.pdf

poster.pdf (179)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

9 Views

Recurrent Phase Reconstruction Using Estimated Phase Derivatives from Deep Neural Networks

This paper presents a deep neural network (DNN)-based system for phase reconstruction of speech signals solely from their magnitude spectrograms. The phase is very sensitive to time shifts. Therefore it is meaningful to estimate the phase derivatives instead of the phase directly, e.g., using DNNs and then apply a phase reconstruction method to recombine these estimates to a suitable phase spectrum. In this paper, we propose three changes for such a two-stage phase reconstruction system.

Presentation_Recurrent_Phase_Reconstruction_Using_Estimated_Phase_Derivatives_from_Deep_Neural_Networks.pdf

Presentation_Recurrent_Phase_Reconstruction_Using_Estimated_Phase_Derivatives_from_Deep_Neural_Networks.pdf (277)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

38 Views

Recurrent Phase Reconstruction Using Estimated Phase Derivatives from Deep Neural Networks

Poster_Recurrent_Phase_Reconstruction_Using_Estimated_Phase_Derivatives_from_Deep_Neural_Networks.pdf

Poster_Recurrent_Phase_Reconstruction_Using_Estimated_Phase_Derivatives_from_Deep_Neural_Networks.pdf (314)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

51 Views

MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames

Read more about MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames
Log in to post comments

This work is concerned with non-parallel voice conversion. In particular, motivated by the recent advances in mel-spectrogram-based vocoders, we focus on conversions in the mel-spectrogram domain based on CycleGAN. The challenge is how to make the converter able to convert only the voice factors while retaining the linguistic content factors that underlie input mel-spectrograms. To solve this, we propose MaskCycleGAN-VC, which is an extension of CycleGAN-VC2 and is trained using a novel auxiliary task called filling in frames (FIF). With FIF, we apply a temporal mask to the input mel-spectrogram and encourage the converter to fill in missing frames based on surrounding frames. This task allows the converter to learn time-frequency structures in a self-supervised manner. A subjective evaluation of the naturalness and speaker similarity showed that MaskCycleGAN-VC outperformed previous CycleGAN-VCs.

MaskCycleGAN-VC_slides.pdf

Presentation slides (230)

MaskCycleGAN-VC_poster.pdf

Poster (273)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

51 Views

NON-PARALLEL MANY-TO-MANY VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM A TEXT-TO-SPEECH MODEL

5305.pdf

5305.pdf (255)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

16 Views

Context-Aware Prosody Correction for Text-Based Speech Editing

Read more about Context-Aware Prosody Correction for Text-Based Speech Editing
Log in to post comments

Text-based speech editors expedite the process of editing speech recordings by permitting editing via intuitive cut, copy, and paste operations on a speech transcript. A major drawback of current systems, however, is that edited recordings often sound unnatural because of prosody mismatches around edited regions. In our work, we propose a new context-aware method for more natural sounding text-based editing of speech.

icassp-2021-poster.pdf

icassp-2021-poster.pdf (264)

icassp_2021_context-aware.pdf

icassp_2021_context-aware.pdf (239)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

15 Views

Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators

This paper proposes voicing-aware conditional discriminators for Parallel WaveGAN-based waveform synthesis systems. In this framework, we adopt a projection-based conditioning method that can significantly improve the discriminator's performance. Furthermore, the conventional discriminator is separated into two waveform discriminators for modeling voiced and unvoiced speech.

20210421_icassp2021_pwgvuvd_v4.pdf

20210421_icassp2021_pwgvuvd_v4.pdf (293)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

18 Views

Speech Synthesis and Generation, including TTS (SPE-SYNT)

Pages