Speech Processing

LEVERAGING EFFECTIVE LANGUAGE AND SPEAKER CONDITIONING IN INDIC TTS FOR LIMMITS 2024 CHALLENGE

In this paper, we explain the model that was developed by the NLP\_POSTECH team for the LIMMITS 2024 Grand Challenge. Among the three tracks, we focus on Track 1, which necessitates the creation of a few-shot text-to-speech (TTS) system that generates natural speech across diverse languages. Towards this end, to realize multi-lingual capability, we incorporate a learnable language embedding. In addition, for precise imitation of target speaker voices, we leverage an inductive speaker bias conditioning methodology.

ICASSP 2024.pptx.pdf

ICASSP 2024.pptx.pdf (295)

Categories:: Speech Processing

114 Views

Presentation of Diffusion-based speech enhancement with a weighted generative-supervised learning loss

Diffusion-based generative models have recently gained attention in speech enhancement (SE), providing an alternative to conventional supervised methods. These models transform clean speech training samples into Gaussian noise, usually centered on noisy speech, and subsequently learn a parameterized

generative_supervised_loss_presentation_jean_eudes_ayilo_16_04_2024.pdf

generative_supervised_loss_presentation_jean_eudes_ayilo_16_04_2024.pdf (256)

Categories:: Speech Processing

42 Views

LEVERAGING EFFECTIVE LANGUAGE AND SPEAKER CONDITIONING IN INDIC TTS FOR LIMMITS 2024 CHALLENGE

ICASSP 2024.pptx.pdf

ICASSP 2024.pptx.pdf (236)

Categories:: Speech Processing

23 Views

Speaker anonymization using neural audio codec language models

Read more about Speaker anonymization using neural audio codec language models
Log in to post comments

The vast majority of approaches to speaker anonymization involve the extraction of fundamental frequency estimates, linguistic features and a speaker embedding which is perturbed to obfuscate the speaker identity before an anonymized speech waveform is resynthesized using a vocoder.
Recent work has shown that x-vector transformations are difficult to control consistently: other sources of speaker information contained within fundamental frequency and linguistic features are re-entangled upon vocoding, meaning that anonymized speech signals still contain speaker information.

NAC poster ICASSP.pdf

NAC poster ICASSP.pdf (290)

Categories:: Speech Processing

42 Views

CONTEXTUAL BIASING OF NAMED-ENTITIES WITH LARGE LANGUAGE MODELS

Read more about CONTEXTUAL BIASING OF NAMED-ENTITIES WITH LARGE LANGUAGE MODELS
Log in to post comments

We explore contextual biasing with Large Language Models (LLMs) to enhance Automatic Speech Recognition (ASR) in second-pass rescoring. Our approach introduces the utilization of prompts for LLMs during rescoring without the need for fine-tuning. These prompts incorporate a biasing list and a set of few-shot examples, serving as supplementary sources of information when evaluating the hypothesis score. Furthermore, we introduce multi-task training for LLMs to predict entity class and the subsequent token.

ICASSP_24_poster_CSUN.pdf

ICASSP_24_poster_CSUN.pdf (258)

Categories:: Speech Processing

27 Views

PSEUDO-LABEL BASED SUPERVISED CONTRASTIVE LOSS FOR ROBUST SPEECH REPRESENTATIONS

Read more about PSEUDO-LABEL BASED SUPERVISED CONTRASTIVE LOSS FOR ROBUST SPEECH REPRESENTATIONS
1 comment
Log in to post comments

The self supervised learning (SSL) of speech, with discrete tokenization (pseudo-labels), while illustrating performance improvements in low-resource speech recognition, has faced challenges in achieving context invariant and noise robust representations. In this paper,we propose a self-supervised framework based on contrastive loss of the pseudo-labels, obtained from an offline k-means quantizer (tokenizer). We refer to the proposed setting as pseudo-con.

Self_Supervised_Contrastive_Loss___ASRU_2023-2.pdf

Main Paper (455)

Categories:: Speech Processing

104 Views

Dynamic Speech Endpoint Detection with Regression Targets

Read more about Dynamic Speech Endpoint Detection with Regression Targets
Log in to post comments

poster.pdf

poster.pdf (287)

Categories:: Speech Processing

13 Views

On Language Model Integration for RNN Transducer based Speech Recognition

Read more about On Language Model Integration for RNN Transducer based Speech Recognition
Log in to post comments

The mismatch between an external language model (LM) and the implicitly learned internal LM (ILM) of RNN-Transducer (RNN-T) can limit the performance of LM integration such as simple shallow fusion. A Bayesian interpretation suggests to remove this sequence prior as ILM correction. In this work, we study various ILM correction-based LM integration methods formulated in a common RNN-T framework. We provide a decoding interpretation on two major reasons for performance improvement with ILM correction, which is further experimentally verified with detailed analysis.