Audio and Acoustic Signal Processing

Continuous Speech Separation with Conformer

Read more about Continuous Speech Separation with Conformer
Log in to post comments

Conformer_poster.pdf

Conformer_poster.pdf (209)

Categories:: Audio and Acoustic Signal Processing

116 Views

ZERO-SHOT AUDIO CLASSIFICATION WITH FACTORED LINEAR AND NONLINEAR ACOUSTIC-SEMANTIC PROJECTIONS

In this paper, we study zero-shot learning in audio classification through factored linear and nonlinear acoustic-semantic projections between audio instances and sound classes. Zero-shot learning in audio classification refers to classification problems that aim at recognizing audio instances of sound classes, which have no available training data but only semantic side information. In this paper, we address zero-shot learning by employing factored linear and nonlinear acoustic-semantic projections.

ICASSP2021_paper3766_slides.pdf

ICASSP2021_paper3766_slides.pdf (164)

Categories:: Audio and Acoustic Signal Processing

15 Views

A TWO-STAGE APPROACH TO DEVICE-ROBUST ACOUSTIC SCENE CLASSIFICATION

Read more about A TWO-STAGE APPROACH TO DEVICE-ROBUST ACOUSTIC SCENE CLASSIFICATION
Log in to post comments

To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed. Our two-stage system leverages on an ad-hoc score combination based on two CNN classifiers: (i) the first CNN classifies acoustic inputs into one of three broad classes, and (ii) the second CNN classifies the same inputs into one of ten finergrained classes.

ICASSP2953poster.pdf

ICASSP2953poster.pdf (176)

ICASSP2953slides.pdf

slides (183)

Categories:: Audio and Acoustic Signal Processing

9 Views

Emotion Controllable Speech Synthesis Using Emotion-unlabeled Dataset With The Assistance Of Cross-domain Speech Emotion Recognition

Neural text-to-speech (TTS) approaches generally require a huge number of high quality speech data, which makes it difficult to obtain such a dataset with extra emotion labels. In this paper, we propose a novel approach for emotional TTS synthesis on a TTS dataset without emotion labels. Specifically, our proposed method consists of a cross-domain speech emotion recognition (SER) model and an emotional TTS model. Firstly, we train the cross-domain SER model on both SER and TTS datasets.

icassp_poster_xiongcai.pdf

icassp_poster_xiongcai.pdf (208)

Categories:: Audio and Acoustic Signal Processing

12 Views

ICASSP 2019 presentation slides

Read more about ICASSP 2019 presentation slides
Log in to post comments

We propose a complex-valued deep neural network (cDNN) for speech enhancement and source separation. While existing end-to-end systems use complex-valued gradients to pass the training error to a real-valued DNN used for gain mask estimation, we use the full potential of complex-valued LSTMs, MLPs and activation functions to estimate complex-valued beamforming weights directly from complex-valued microphone array data. By doing so, our cDNN is able to locate and track different moving sources by exploiting the phase information in the data.