Scaling NVIDIA’s Multi-Speaker Multi-Lingual TTS Systems with Zero-Shot TTS to Indic Languages

DOI:: 10.60864/8abj-ty05
Citation Author(s):: Akshit Arora
Rohan Badlani

Sungwon Kim, Rafael Valle, Bryan Catanzaro
Submitted by:: Akshit Arora
Last updated:: 17 April 2024 - 2:02am
Document Type:: Presentation Slides
Document Year:: 2024
Event:: IEEE ICASSP 2024
Presenters:: Akshit Arora, Sungwon Kim
Paper Code:: 11939

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.

icassp_2024_04172024.pdf

icassp_2024_04172024.pdf (20)

Links:

Project Page

Thumbs Up

CITE

Documents

Presentation Slides

Scaling NVIDIA’s Multi-Speaker Multi-Lingual TTS Systems with Zero-Shot TTS to Indic Languages

icassp_2024_04172024.pdf

QUESTIONS?