Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

FREQUENCY DOMAIN MULTI-CHANNEL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION

Read more about FREQUENCY DOMAIN MULTI-CHANNEL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION
Log in to post comments

Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for speech enhancement in order to improve robustness against noise or reverberation. However, such speech enhancement techniques do not always yield ASR accuracy improvement because the optimization criterion for speech enhancement is not directly relevant to the ASR objective. In this work, we develop new acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly.

kumatani_poster_icassp2019a.pdf

poster file (635)

template.pdf

manuscript file (423)

Categories:: Spatial and Multichannel Audio
Robust Speech Recognition (SPE-ROBU)
Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

23 Views

Improving Children Speech Recognition through Feature Learning from Raw Speech Signal

Read more about Improving Children Speech Recognition through Feature Learning from Raw Speech Signal
Log in to post comments

ChildrenSpeechASR.pdf

ChildrenSpeechASR.pdf (368)

ChildrenSpeechASR.pdf

ChildrenSpeechASR.pdf (388)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

56 Views

Segment-level training based on Confidence Measures for Hybrid HMM/ANN Speech Recognition

Poster___Segment_level_training.pdf

Poster___Segment_level_training.pdf (310)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

4 Views

ENCRYPTED SPEECH RECOGNITION USING DEEP POLYNOMIAL NETWORKS

Read more about ENCRYPTED SPEECH RECOGNITION USING DEEP POLYNOMIAL NETWORKS
Log in to post comments

The cloud-based speech recognition/API provides developers or enterprises an easy way to create speech-enabled features in their applications. However, sending audios about personal or company internal information to the cloud, raises concerns about the privacy and security issues. The recognition results generated in cloud may also reveal some sensitive information. This paper proposes a deep polynomial network (DPN) that can be applied to the encrypted speech as an acoustic model. It allows clients to send their data in an encrypted form to the cloud to ensure that their data remains confidential, at mean while the DPN can still make frame-level predictions over the encrypted speech and return them in encrypted form. One good property of the DPN is that it can be trained on unencrypted speech features in the traditional way. To keep the cloud away from the raw audio and recognition results, a cloud-local joint decoding framework is also proposed. We demonstrate the effectiveness of model and framework on the Switchboard and Cortana voice assistant tasks with small performance degradation and latency increased comparing with the traditional cloud-based DNNs.
https://ieeexplore.ieee.org/document/8683721

EncryptASR_slides_V2.pdf

Encrypted_ASR (521)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)
Neural network learning (MLR-NNLR)
Signal Processing and Cryptography

22 Views

Word Characters and Phone Pronunciation Embedding for ASR Confidence Classifier

Read more about Word Characters and Phone Pronunciation Embedding for ASR Confidence Classifier
Log in to post comments

Confidences are integral to ASR systems, and applied to data selection, adaptation, ranking hypotheses, arbitration etc.Hybrid ASR system is inherently a match between pronunciations and AM+LM evidence but current confidence features lack pronunciation information. We develop pronunciation embeddings to represent and factorize acoustic score in relevant bases, and demonstrate 8-10% relative reduction in false alarm (FA) on large scale tasks. We generalize to standard NLP embeddings like Glove, and show 16% relative reduction in FA in combination with Glove.

WordEmbed_v5.pdf

WordEmbed_v5.pdf (472)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)
Large Vocabulary Continuous Recognition/Search (SPE-LVCR)

37 Views

SPECTRAL DISTORTION MODEL FOR TRAINING PHASE-SENSITIVE DEEP-NEURAL NETWORKS FOR FAR-FIELD SPEECH RECOGNITION

In this paper, we present an algorithm which introduces phase-perturbation to the training database when training phase-sensitive deep neural-network models. Traditional features such as log-mel or cepstral features do not have have any phase-relevant information.However features such as raw-waveform or complex spectra features contain phase-relevant information. Phase-sensitive features have the advantage of being able to detect differences in time of

icassp_4404_poster.pdf

icassp_4404_poster.pdf (615)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)
Robust Speech Recognition (SPE-ROBU)

17 Views

Acoustic modeling of speech waveform based on multi-resolution, neural network signal processing

Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a sec- ond level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations.

slides-template.pdf

slides-template.pdf (568)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

18 Views

HYBRID LSTM-FSMN NETWORKS FOR ACOUSTIC MODELING

Read more about HYBRID LSTM-FSMN NETWORKS FOR ACOUSTIC MODELING
Log in to post comments

FLMN Poster.pdf

FLMN Poster.pdf (297)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

24 Views

Dropout approaches for LSTM based speech recognition systems

Read more about Dropout approaches for LSTM based speech recognition systems
Log in to post comments

In this paper we examine dropout approaches in a Long Short Term Memory (LSTM) based automatic speech recognition (ASR) system trained with the Connectionist Temporal Classification (CTC) loss function. In particular, using an Eesen based LSTM-CTC speech recognition system, we present dropout implementations that result in significant improvements in speech recognizer performance on Librispeech and GALE Arabic datasets, with 24.64% and 13.75% relative reduction in word error rates (WER) from their respective baselines.

ICASSP2018-dropout poster.pdf

ICASSP2018 Poster (388)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

45 Views

A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR

Read more about A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR
Log in to post comments

Self-attention -- an attention mechanism where the input and output
sequence lengths are the same -- has
recently been successfully applied to machine translation, caption generation, and phoneme recognition.
In this paper we apply a restricted self-attention mechanism (with
multiple heads) to speech recognition. By ``restricted'' we
mean that the mechanism at a particular frame only sees input from a
limited number of frames to
the left and right. Restricting the context makes it easier to

Poster - Self-attention.pdf

Poster - Self-attention.pdf (512)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

154 Views

Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

Pages