Sorry, you need to enable JavaScript to visit this website.

Scaling NVIDIA’s Multi-Speaker Multi-Lingual TTS Systems with Zero-Shot TTS to Indic Languages

DOI:
10.60864/8abj-ty05
Citation Author(s):
Sungwon Kim, Rafael Valle, Bryan Catanzaro
Submitted by:
Akshit Arora
Last updated:
17 April 2024 - 2:02am
Document Type:
Presentation Slides
Document Year:
2024
Event:
Presenters:
Akshit Arora, Sungwon Kim
Paper Code:
11939
 

In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.

up
0 users have voted: