- Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)
- General Topics in Speech Recognition (SPE-GASR)
- Large Vocabulary Continuous Recognition/Search (SPE-LVCR)
- Lexical Modeling and Access (SPE-LEXI)
- Multilingual Recognition and Identification (SPE-MULT)
- Resource constrained speech recognition (SPE-RCSR)
- Robust Speech Recognition (SPE-ROBU)
- Speaker Recognition and Characterization (SPE-SPKR)
- Speech Adaptation/Normalization (SPE-ADAP)
- Speech Analysis (SPE-ANLS)
- Speech Coding (SPE-CODI)
- Speech Enhancement (SPE-ENHA)
- Speech Perception and Psychoacoustics (SPE-SPER)
- Speech Production (SPE-SPRD)
- Speech Synthesis and Generation, including TTS (SPE-SYNT)
- Read more about Partially Fake Audio Detection by Self-attention-based Fake Span Discovery
- Log in to post comments
The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios by advanced speech synthesis and voice conversion models, and replay attacks. Recently, the first Audio Deep Synthesis Detection challenge (ADD 2022) extends the attack scenarios into more aspects.
- Categories:
- Read more about UNIVERSAL PARALINGUISTIC SPEECH REPRESENTATIONS USING SELF-SUPERVISED CONFORMERS - ICASSP 2022 Poster
- Log in to post comments
- Categories:
- Read more about Universal Paralinguistic Speech Representations using Self-Supervised Conformers - ICASSP 2022 slides
- Log in to post comments
- Categories:
- Read more about Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling
- Log in to post comments
A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker. This approach makes two simplifying assumptions: a) phonemes can be recognized from speech with high accuracy, b) there is a single correct way for a sentence to be pronounced. These assumptions do not always hold, which can result in a significant amount of false mispronunciation alarms.
- Categories:
- Read more about Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling
- Log in to post comments
A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker. This approach makes two simplifying assumptions: a) phonemes can be recognized from speech with high accuracy, b) there is a single correct way for a sentence to be pronounced. These assumptions do not always hold, which can result in a significant amount of false mispronunciation alarms.
- Categories:
- Read more about PRE-TRAINING TRANSFORMER DECODER FOR END-TO-END ASR MODEL WITH UNPAIRED TEXT DATA
- Log in to post comments
- Categories:
- Read more about Speech Emotion Recognition based on Listener Adaptive Models
- Log in to post comments
- Categories:
- Read more about Have You Made A Decision? Where? A Pilot Study on Interpretability of Polarity Analysis Based on Advising Problem
- Log in to post comments
The general approaches for polarity analysis in dialogue, e.g. Multiple Instance Learning (MIL), have achieved significant progress.
However, one significant drawback of current approaches is that the contribution of an utterance towards the polarity being a \emph{black-box}.
For existing methods, the polarity contained in each utterance, which we call meta-polarity, is not explicitly utilized.
In this paper, we study the problem of adding interpretability to the overall polarity by predicting the meta-polarity at the same time.
- Categories:
- Read more about REDAT: ACCENT-INVARIANT REPRESENTATION FOR END-TO-END ASR BY DOMAIN ADVERSARIAL TRAINING WITH RELABELING
- Log in to post comments
Accents mismatching is a critical problem for end-to-end ASR. This paper aims to address this problem by building an accent-robust RNN-T system with domain adversarial training (DAT). We unveil the magic behind DAT and provide, for the first time, a theoretical guarantee that DAT learns accent-invariant representations. We also prove that performing the gradient reversal in DAT is equivalent to minimizing the Jensen-Shannon divergence between domain output distributions.
- Categories: