Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

THE ROYALFLUSH SYSTEM OF SPEECH RECOGNITION FOR M2MET CHALLENGE

Read more about THE ROYALFLUSH SYSTEM OF SPEECH RECOGNITION FOR M2MET CHALLENGE
Log in to post comments

This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement and so on, to process training, validation and test sets. But we only selected WPE and beamforming as our frontend methods according to their experimental results.

M2MeT_ICASSP2022_v2.0.pdf

THE ROYALFLUSH SYSTEM OF SPEECH RECOGNITION FOR M2MET CHALLENGE (183)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

15 Views

Continuous Streaming Multi-talker ASR with Dual-path Transducers

Read more about Continuous Streaming Multi-talker ASR with Dual-path Transducers
Log in to post comments

icassp_2022_multi_surt_poster.pdf

Poster (174)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

6 Views

HAVE BEST OF BOTH WORLDS: TWO-PASS HYBRID AND E2E CASCADING FRAMEWORK FOR SPEECH RECOGNITION

poster.pdf

ICASSP 2022 poster (164)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

4 Views

END-TO-END MULTILINGUAL AUTOMATIC SPEECH RECOGNITION FOR LESS-RESOURCED LANGUAGES: THE CASE OF FOUR ETHIOPIAN LANGUAGES

End-to-End (E2E) approach, which maps a sequence of input features into a sequence of grapheme or words, to Automatic Speech Recognition (ASR) is a hot research agenda. It is interesting for less-resourced languages since it avoids the use of pronunciation dictionary, which is one of the major components in the traditional ASR systems. However, like any deep neural network (DNN) approaches, E2E is data greedy. This makes the application of E2E to less-resourced languages questionable.

MarthaSolomonTanja_Poster.pdf

MarthaSolomonTanja_Poster.pdf (332)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

53 Views

Wake Word Detection with Streaming Transformers

Read more about Wake Word Detection with Streaming Transformers
Log in to post comments

Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space complexity.

ICASSP2021_poster.pdf

ICASSP2021_poster.pdf (336)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

20 Views

ADVANCING RNN TRANSDUCER TECHNOLOGY FOR SPEECH RECOGNITION

Read more about ADVANCING RNN TRANSDUCER TECHNOLOGY FOR SPEECH RECOGNITION
Log in to post comments

We investigate a set of techniques for RNN Transducers (RNN-Ts) that were instrumental in lowering the word error rate on three different tasks (Switchboard 300 hours, conversational Spanish 780 hours and conversational Italian 900 hours). The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe. First, we introduce a novel multiplicative integration of the encoder and prediction network vectors in the joint network (as opposed to additive).

icassp2021-slides.pdf

icassp2021-slides.pdf (261)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

10 Views

Phoneme based Neural Transducer for Large Vocabulary Speech Recognition

Read more about Phoneme based Neural Transducer for Large Vocabulary Speech Recognition
Log in to post comments

To join the advantages of classical and end-to-end approaches for speech recognition, we present a simple, novel and competitive approach for phoneme-based neural transducer modeling. Different alignment label topologies are compared and word-end-based phoneme label augmentation is proposed to improve performance. Utilizing the local dependency of phonemes, we adopt a simplified neural network structure and a straightforward integration with the external word-level language model to preserve the consistency of seq-to-seq modeling.