Audio and Acoustic Signal Processing

JOINT MASKED CPC AND CTC TRAINING FOR ASR

Read more about JOINT MASKED CPC AND CTC TRAINING FOR ASR
Log in to post comments

Self-supervised learning (SSL) has shown promise in learning representations of audio that are useful for automatic speech recognition (ASR). But, training SSL models like wav2vec~2.0 requires a two-stage pipeline. In this paper we demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data. During training, we alternately minimize two losses: an unsupervised masked Contrastive Predictive Coding (CPC) loss and the supervised audio-to-text alignment loss Connectionist Temporal Classification (CTC).

4384_Joint_CPC_CTC_poster_ICASSP2021.pdf

Poster (307)

4384_Joint_CPC_CTC_presentation_ICASSP2021.pdf

Presentation (358)

Categories:: Audio and Acoustic Signal Processing

132 Views

VOWEL NON-VOWEL BASED SPECTRAL WARPING AND TIME SCALE MODIFICATION FOR IMPROVEMENT IN CHILDREN’S ASR

Acoustic differences between children’s and adults’ speech causes the degradation in the automatic speech recognition system performance when system trained on adults’ speech and tested on children’s speech. The key acoustic mismatch factors are formant, speaking rate, and pitch. In this paper, we proposed a linear prediction based spectral warping method by using the knowledge of vowel and non-vowel regions in speech signals to mitigate the formant frequencies differences between child and adult speakers.

ICASSP_21_paperid-3362_ppts.pdf

ICASSP_21_paperid-3362_ppts.pdf (297)

Categories:: Audio and Acoustic Signal Processing

18 Views

Continuous Speech Separation with Conformer

Read more about Continuous Speech Separation with Conformer
Log in to post comments

Conformer_poster.pdf

Conformer_poster.pdf (377)

Categories:: Audio and Acoustic Signal Processing

140 Views

ZERO-SHOT AUDIO CLASSIFICATION WITH FACTORED LINEAR AND NONLINEAR ACOUSTIC-SEMANTIC PROJECTIONS

In this paper, we study zero-shot learning in audio classification through factored linear and nonlinear acoustic-semantic projections between audio instances and sound classes. Zero-shot learning in audio classification refers to classification problems that aim at recognizing audio instances of sound classes, which have no available training data but only semantic side information. In this paper, we address zero-shot learning by employing factored linear and nonlinear acoustic-semantic projections.

ICASSP2021_paper3766_slides.pdf

ICASSP2021_paper3766_slides.pdf (294)

Categories:: Audio and Acoustic Signal Processing

18 Views

A TWO-STAGE APPROACH TO DEVICE-ROBUST ACOUSTIC SCENE CLASSIFICATION

Read more about A TWO-STAGE APPROACH TO DEVICE-ROBUST ACOUSTIC SCENE CLASSIFICATION
Log in to post comments

To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed. Our two-stage system leverages on an ad-hoc score combination based on two CNN classifiers: (i) the first CNN classifies acoustic inputs into one of three broad classes, and (ii) the second CNN classifies the same inputs into one of ten finergrained classes.

ICASSP2953poster.pdf

ICASSP2953poster.pdf (297)

ICASSP2953slides.pdf

slides (305)

Categories:: Audio and Acoustic Signal Processing

11 Views

Emotion Controllable Speech Synthesis Using Emotion-unlabeled Dataset With The Assistance Of Cross-domain Speech Emotion Recognition

Neural text-to-speech (TTS) approaches generally require a huge number of high quality speech data, which makes it difficult to obtain such a dataset with extra emotion labels. In this paper, we propose a novel approach for emotional TTS synthesis on a TTS dataset without emotion labels. Specifically, our proposed method consists of a cross-domain speech emotion recognition (SER) model and an emotional TTS model. Firstly, we train the cross-domain SER model on both SER and TTS datasets.

icassp_poster_xiongcai.pdf

icassp_poster_xiongcai.pdf (322)

Categories:: Audio and Acoustic Signal Processing

14 Views

ICASSP 2019 presentation slides

Read more about ICASSP 2019 presentation slides
Log in to post comments

We propose a complex-valued deep neural network (cDNN) for speech enhancement and source separation. While existing end-to-end systems use complex-valued gradients to pass the training error to a real-valued DNN used for gain mask estimation, we use the full potential of complex-valued LSTMs, MLPs and activation functions to estimate complex-valued beamforming weights directly from complex-valued microphone array data. By doing so, our cDNN is able to locate and track different moving sources by exploiting the phase information in the data.