Speech Processing

Speaker-Invariant Training via Adversarial Learning

Read more about Speaker-Invariant Training via Adversarial Learning
Log in to post comments

We propose a novel adversarial multi-task learning scheme, aiming at actively curtailing the inter-talker feature variability while maximizing its senone discriminability so as to enhance the performance of a deep neural network (DNN) based ASR system. We call the scheme speaker-invariant training (SIT). In SIT, a DNN acoustic model and a speaker classifier network are jointly optimized to minimize the senone (tied triphone state) classification loss, and simultaneously mini-maximize the speaker classification loss.

sit_poster.pptx

sit_poster.pptx (481)

Categories:: Speech Processing
Audio and Acoustic Signal Processing
Machine Learning for Signal Processing

20 Views

Adversarial Teacher-Student Learning for Unsupervised Adaptation

Read more about Adversarial Teacher-Student Learning for Unsupervised Adaptation
Log in to post comments

The teacher-student (T/S) learning has been shown effective in unsupervised domain adaptation ts_adapt. It is a form of transfer learning, not in terms of the transfer of recognition decisions, but the knowledge of posteriori probabilities in the source domain as evaluated by the teacher model. It learns to handle the speaker and environment variability inherent in and restricted to the speech signal in the target domain without proactively addressing the robustness to other likely conditions. Performance degradation may thus ensue.

ats_poster_v2.pptx

ats_poster_v2.pptx (447)

Categories:: Speech Processing
Audio and Acoustic Signal Processing
Machine Learning for Signal Processing

38 Views

A PLLR and Multi-stage Staircase Regression Framework for Speech-based Emotion Prediction

Continuous prediction of dimensional emotions (e.g. arousal and valence) has attracted increasing research interest recently. When processing emotional speech signals, phonetic features have been rarely used due to the assumption that phonetic variability is a confounding factor that degrades emotion recognition/prediction performance. In this paper, instead of eliminating phonetic variability, we investigated whether Phone Log-Likelihood Ratio (PLLR) features could be used to index arousal and valence in a pairwise low/high framework.

DAVID_ICASSP2017_V1.pdf

DAVID_ICASSP2017_V1.pdf (691)

Categories:: Speech Processing

7 Views

DNN APPROACH TO SPEAKER DIARISATION USING SPEAKER CHANNELS

Read more about DNN APPROACH TO SPEAKER DIARISATION USING SPEAKER CHANNELS
Log in to post comments

talk-dia-icassp17-milner.pdf

talk-dia-icassp17-milner.pdf (594)

Categories:: Speech Processing

9 Views

Automatic Speech Emotion Recognition Using Recurrent Neural Networks with Local Attention

Automatic emotion recognition from speech is a challenging task which relies heavily on the effectiveness of the speech features used for classification. In this work, we study the use of deep learning to automatically discover emotionally relevant features from speech. It is shown that using a deep recurrent neural network, we can learn both the short-time frame-level acoustic features that are emotionally relevant, as well as an appropriate temporal aggregation of those features into a compact utterance-level representation.

icassp2017.pptx

icassp2017.pptx (596)

icassp2017.pdf

icassp2017.pdf (953)

Categories:: Neural network learning (MLR-NNLR)
Speech Processing

194 Views

ICASSP_FCDNNBSS_poster

Read more about ICASSP_FCDNNBSS_poster
Log in to post comments

ICASSP2017_poster.pdf

ICASSP2017_poster.pdf (288)

Categories:: Speech Processing

1 Views

An Initial Study of Indonesian Semantic Role Labelingand Its Application on Event Extraction

Semantic role labeling (SRL) is a task to as- sign semantic role labels to sentence elements. This pa- per describes the initial development of an Indonesian semantic role labeling system and its application to extract event information from Tweets. We compare two feature types when designing the SRL systems: Word-to-Word and Phrase-to-Phrase. Our experiments showed that the Word- to-Word feature approach outperforms the Phrase-to-Phrase approach. The application of the SRL system to an event extraction problem resulted overlap-based accuracy of 0.94 for the actor identification.

presentation_IALP2016_Ade.pdf

presentation_IALP2016_Ade.pdf (764)

Categories:: Speech Processing

6 Views

An Investigation of Adaptation Techniques for Building Acoustic Models for Hearing-impaired Children in a CAPT Application

sildes.pdf

presentation slides (407)

Categories:: Speech Processing

6 Views

Speaker Diarization System for Autism Children’s Real-Life Audio Data

Read more about Speaker Diarization System for Autism Children’s Real-Life Audio Data
Log in to post comments

167.pdf

167.pdf (417)

Categories:: Speech Processing

13 Views

Perceptual Evaluation of Natural and Synthesized Speech with Prosodic Focus in Mandarin Production of American Learners

Natural and synthesized speech in L2 Mandarin produced by American English learners was evaluated by native Mandarin speakers to identify focus status and rate the naturalness of the speech. The results reveal that natural speech was recognized and rated better than synthesized speech, early learners’ speech better than late learners’ speech, focused sentences better than no-focus sentences, and initial focus and medial focus better than final focus. Tones of in-focus words interacted with focus status of the sentence and speaker group.

ChenEtAl._ISCSLP2016_poster.pdf

ChenEtAl._ISCSLP2016_poster.pdf (745)

Categories:: Speech Processing

3 Views

Speech Processing

Pages