Documents
Presentation Slides
Presentation Slides
Scaling NVIDIA’s Multi-Speaker Multi-Lingual TTS Systems with Zero-Shot TTS to Indic Languages
- DOI:
- 10.60864/8abj-ty05
- Citation Author(s):
- Submitted by:
- Akshit Arora
- Last updated:
- 6 June 2024 - 10:23am
- Document Type:
- Presentation Slides
- Document Year:
- 2024
- Event:
- Presenters:
- Akshit Arora, Sungwon Kim
- Paper Code:
- 11939
- Categories:
- Log in to post comments
In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.
Links: