Sorry, you need to enable JavaScript to visit this website.

Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

SPECTRAL DISTORTION MODEL FOR TRAINING PHASE-SENSITIVE DEEP-NEURAL NETWORKS FOR FAR-FIELD SPEECH RECOGNITION


In this paper, we present an algorithm which introduces phase-perturbation to the training database when training phase-sensitive deep neural-network models. Traditional features such as log-mel or cepstral features do not have have any phase-relevant information.However features such as raw-waveform or complex spectra features contain phase-relevant information. Phase-sensitive features have the advantage of being able to detect differences in time of

Paper Details

Authors:
Chanwoo Kim, Tara Sainath, Arun Narayanan, Ananya Misra, Rajeev Nongpiur, Michiel Bacchiani
Submitted On:
7 May 2018 - 12:19am
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

icassp_4404_poster.pdf

(86 downloads)

Subscribe

[1] Chanwoo Kim, Tara Sainath, Arun Narayanan, Ananya Misra, Rajeev Nongpiur, Michiel Bacchiani, "SPECTRAL DISTORTION MODEL FOR TRAINING PHASE-SENSITIVE DEEP-NEURAL NETWORKS FOR FAR-FIELD SPEECH RECOGNITION", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/3202. Accessed: Sep. 20, 2018.
@article{3202-18,
url = {http://sigport.org/3202},
author = {Chanwoo Kim; Tara Sainath; Arun Narayanan; Ananya Misra; Rajeev Nongpiur; Michiel Bacchiani },
publisher = {IEEE SigPort},
title = {SPECTRAL DISTORTION MODEL FOR TRAINING PHASE-SENSITIVE DEEP-NEURAL NETWORKS FOR FAR-FIELD SPEECH RECOGNITION},
year = {2018} }
TY - EJOUR
T1 - SPECTRAL DISTORTION MODEL FOR TRAINING PHASE-SENSITIVE DEEP-NEURAL NETWORKS FOR FAR-FIELD SPEECH RECOGNITION
AU - Chanwoo Kim; Tara Sainath; Arun Narayanan; Ananya Misra; Rajeev Nongpiur; Michiel Bacchiani
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/3202
ER -
Chanwoo Kim, Tara Sainath, Arun Narayanan, Ananya Misra, Rajeev Nongpiur, Michiel Bacchiani. (2018). SPECTRAL DISTORTION MODEL FOR TRAINING PHASE-SENSITIVE DEEP-NEURAL NETWORKS FOR FAR-FIELD SPEECH RECOGNITION. IEEE SigPort. http://sigport.org/3202
Chanwoo Kim, Tara Sainath, Arun Narayanan, Ananya Misra, Rajeev Nongpiur, Michiel Bacchiani, 2018. SPECTRAL DISTORTION MODEL FOR TRAINING PHASE-SENSITIVE DEEP-NEURAL NETWORKS FOR FAR-FIELD SPEECH RECOGNITION. Available at: http://sigport.org/3202.
Chanwoo Kim, Tara Sainath, Arun Narayanan, Ananya Misra, Rajeev Nongpiur, Michiel Bacchiani. (2018). "SPECTRAL DISTORTION MODEL FOR TRAINING PHASE-SENSITIVE DEEP-NEURAL NETWORKS FOR FAR-FIELD SPEECH RECOGNITION." Web.
1. Chanwoo Kim, Tara Sainath, Arun Narayanan, Ananya Misra, Rajeev Nongpiur, Michiel Bacchiani. SPECTRAL DISTORTION MODEL FOR TRAINING PHASE-SENSITIVE DEEP-NEURAL NETWORKS FOR FAR-FIELD SPEECH RECOGNITION [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/3202

Acoustic modeling of speech waveform based on multi-resolution, neural network signal processing


Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a sec- ond level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations.

Paper Details

Authors:
Zoltán Tüske, Ralf Schlüter, Hermann Ney
Submitted On:
2 May 2018 - 3:00pm
Short Link:
Type:
Event:
Presenter's Name:
Document Year:
Cite

Document Files

slides-template.pdf

(109 downloads)

Subscribe

[1] Zoltán Tüske, Ralf Schlüter, Hermann Ney, "Acoustic modeling of speech waveform based on multi-resolution, neural network signal processing", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/3199. Accessed: Sep. 20, 2018.
@article{3199-18,
url = {http://sigport.org/3199},
author = {Zoltán Tüske; Ralf Schlüter; Hermann Ney },
publisher = {IEEE SigPort},
title = {Acoustic modeling of speech waveform based on multi-resolution, neural network signal processing},
year = {2018} }
TY - EJOUR
T1 - Acoustic modeling of speech waveform based on multi-resolution, neural network signal processing
AU - Zoltán Tüske; Ralf Schlüter; Hermann Ney
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/3199
ER -
Zoltán Tüske, Ralf Schlüter, Hermann Ney. (2018). Acoustic modeling of speech waveform based on multi-resolution, neural network signal processing. IEEE SigPort. http://sigport.org/3199
Zoltán Tüske, Ralf Schlüter, Hermann Ney, 2018. Acoustic modeling of speech waveform based on multi-resolution, neural network signal processing. Available at: http://sigport.org/3199.
Zoltán Tüske, Ralf Schlüter, Hermann Ney. (2018). "Acoustic modeling of speech waveform based on multi-resolution, neural network signal processing." Web.
1. Zoltán Tüske, Ralf Schlüter, Hermann Ney. Acoustic modeling of speech waveform based on multi-resolution, neural network signal processing [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/3199

HYBRID LSTM-FSMN NETWORKS FOR ACOUSTIC MODELING

Paper Details

Authors:
Asa Oines, Eugene Weinstein, Pedro Moreno
Submitted On:
19 April 2018 - 9:47pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

FLMN Poster.pdf

(197 downloads)

Subscribe

[1] Asa Oines, Eugene Weinstein, Pedro Moreno, "HYBRID LSTM-FSMN NETWORKS FOR ACOUSTIC MODELING", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/3054. Accessed: Sep. 20, 2018.
@article{3054-18,
url = {http://sigport.org/3054},
author = {Asa Oines; Eugene Weinstein; Pedro Moreno },
publisher = {IEEE SigPort},
title = {HYBRID LSTM-FSMN NETWORKS FOR ACOUSTIC MODELING},
year = {2018} }
TY - EJOUR
T1 - HYBRID LSTM-FSMN NETWORKS FOR ACOUSTIC MODELING
AU - Asa Oines; Eugene Weinstein; Pedro Moreno
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/3054
ER -
Asa Oines, Eugene Weinstein, Pedro Moreno. (2018). HYBRID LSTM-FSMN NETWORKS FOR ACOUSTIC MODELING. IEEE SigPort. http://sigport.org/3054
Asa Oines, Eugene Weinstein, Pedro Moreno, 2018. HYBRID LSTM-FSMN NETWORKS FOR ACOUSTIC MODELING. Available at: http://sigport.org/3054.
Asa Oines, Eugene Weinstein, Pedro Moreno. (2018). "HYBRID LSTM-FSMN NETWORKS FOR ACOUSTIC MODELING." Web.
1. Asa Oines, Eugene Weinstein, Pedro Moreno. HYBRID LSTM-FSMN NETWORKS FOR ACOUSTIC MODELING [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/3054

Dropout approaches for LSTM based speech recognition systems


In this paper we examine dropout approaches in a Long Short Term Memory (LSTM) based automatic speech recognition (ASR) system trained with the Connectionist Temporal Classification (CTC) loss function. In particular, using an Eesen based LSTM-CTC speech recognition system, we present dropout implementations that result in significant improvements in speech recognizer performance on Librispeech and GALE Arabic datasets, with 24.64% and 13.75% relative reduction in word error rates (WER) from their respective baselines.

Paper Details

Authors:
Submitted On:
19 April 2018 - 2:52pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

ICASSP2018 Poster

(94 downloads)

Subscribe

[1] , "Dropout approaches for LSTM based speech recognition systems", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/3004. Accessed: Sep. 20, 2018.
@article{3004-18,
url = {http://sigport.org/3004},
author = { },
publisher = {IEEE SigPort},
title = {Dropout approaches for LSTM based speech recognition systems},
year = {2018} }
TY - EJOUR
T1 - Dropout approaches for LSTM based speech recognition systems
AU -
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/3004
ER -
. (2018). Dropout approaches for LSTM based speech recognition systems. IEEE SigPort. http://sigport.org/3004
, 2018. Dropout approaches for LSTM based speech recognition systems. Available at: http://sigport.org/3004.
. (2018). "Dropout approaches for LSTM based speech recognition systems." Web.
1. . Dropout approaches for LSTM based speech recognition systems [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/3004

A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR


Self-attention -- an attention mechanism where the input and output
sequence lengths are the same -- has
recently been successfully applied to machine translation, caption generation, and phoneme recognition.
In this paper we apply a restricted self-attention mechanism (with
multiple heads) to speech recognition. By ``restricted'' we
mean that the mechanism at a particular frame only sees input from a
limited number of frames to
the left and right. Restricting the context makes it easier to

Paper Details

Authors:
Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, Sanjeev Khudanpur
Submitted On:
19 April 2018 - 1:23pm
Short Link:
Type:
Event:
Paper Code:
Document Year:
Cite

Document Files

Poster - Self-attention.pdf

(91 downloads)

Subscribe

[1] Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, Sanjeev Khudanpur, "A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/2986. Accessed: Sep. 20, 2018.
@article{2986-18,
url = {http://sigport.org/2986},
author = {Daniel Povey; Hossein Hadian; Pegah Ghahremani; Ke Li; Sanjeev Khudanpur },
publisher = {IEEE SigPort},
title = {A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR},
year = {2018} }
TY - EJOUR
T1 - A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR
AU - Daniel Povey; Hossein Hadian; Pegah Ghahremani; Ke Li; Sanjeev Khudanpur
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/2986
ER -
Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, Sanjeev Khudanpur. (2018). A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR. IEEE SigPort. http://sigport.org/2986
Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, Sanjeev Khudanpur, 2018. A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR. Available at: http://sigport.org/2986.
Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, Sanjeev Khudanpur. (2018). "A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR." Web.
1. Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, Sanjeev Khudanpur. A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/2986

Attention-based End-to-end Speech Recognition on Voice Search

Paper Details

Authors:
Changhao Shan, Junbo Zhang, Yujun Wang, Lei Xie
Submitted On:
17 April 2018 - 7:35pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

SP-L1.4.pdf

(110 downloads)

Subscribe

[1] Changhao Shan, Junbo Zhang, Yujun Wang, Lei Xie, "Attention-based End-to-end Speech Recognition on Voice Search", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/2949. Accessed: Sep. 20, 2018.
@article{2949-18,
url = {http://sigport.org/2949},
author = {Changhao Shan; Junbo Zhang; Yujun Wang; Lei Xie },
publisher = {IEEE SigPort},
title = {Attention-based End-to-end Speech Recognition on Voice Search},
year = {2018} }
TY - EJOUR
T1 - Attention-based End-to-end Speech Recognition on Voice Search
AU - Changhao Shan; Junbo Zhang; Yujun Wang; Lei Xie
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/2949
ER -
Changhao Shan, Junbo Zhang, Yujun Wang, Lei Xie. (2018). Attention-based End-to-end Speech Recognition on Voice Search. IEEE SigPort. http://sigport.org/2949
Changhao Shan, Junbo Zhang, Yujun Wang, Lei Xie, 2018. Attention-based End-to-end Speech Recognition on Voice Search. Available at: http://sigport.org/2949.
Changhao Shan, Junbo Zhang, Yujun Wang, Lei Xie. (2018). "Attention-based End-to-end Speech Recognition on Voice Search." Web.
1. Changhao Shan, Junbo Zhang, Yujun Wang, Lei Xie. Attention-based End-to-end Speech Recognition on Voice Search [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/2949

Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs


Time delay neural networks (TDNNs) are an effective acoustic model for large vocabulary speech recognition. The strength of the model can be attributed to its ability to effectively model long temporal contexts. However, current TDNN models are relatively shallow, which limits the modelling capability. This paper proposes a method of increasing the network depth by deepening the kernel used in the TDNN temporal convolutions. The best performing kernel consists of three fully connected layers with a residual (ResNet) connection from the output of the first to the output of the third.

Paper Details

Authors:
Florian L. Kreyssig, Chao Zhang, Philip C. Woodland
Submitted On:
15 April 2018 - 2:43am
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

tdnn_lecture_4.pdf

(83 downloads)

Subscribe

[1] Florian L. Kreyssig, Chao Zhang, Philip C. Woodland, "Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/2885. Accessed: Sep. 20, 2018.
@article{2885-18,
url = {http://sigport.org/2885},
author = {Florian L. Kreyssig; Chao Zhang; Philip C. Woodland },
publisher = {IEEE SigPort},
title = {Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs},
year = {2018} }
TY - EJOUR
T1 - Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs
AU - Florian L. Kreyssig; Chao Zhang; Philip C. Woodland
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/2885
ER -
Florian L. Kreyssig, Chao Zhang, Philip C. Woodland. (2018). Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs. IEEE SigPort. http://sigport.org/2885
Florian L. Kreyssig, Chao Zhang, Philip C. Woodland, 2018. Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs. Available at: http://sigport.org/2885.
Florian L. Kreyssig, Chao Zhang, Philip C. Woodland. (2018). "Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs." Web.
1. Florian L. Kreyssig, Chao Zhang, Philip C. Woodland. Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/2885

Sequence-to-Sequence ASR Optimization via Reinforcement Learning


Despite the success of sequence-to-sequence approaches in automatic speech recognition (ASR) systems, the models still suffer from several problems, mainly due to the mismatch between the training and inference conditions. In the sequence-to-sequence architecture, the model is trained to predict the grapheme of the current time-step given the input of speech signal and the ground-truth grapheme history of the previous time-steps. However, it remains unclear how well the model approximates real-world speech during inference.

Paper Details

Authors:
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
Submitted On:
14 April 2018 - 10:37am
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

Poster in PDF format

(61 downloads)

Subscribe

[1] Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, "Sequence-to-Sequence ASR Optimization via Reinforcement Learning", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/2834. Accessed: Sep. 20, 2018.
@article{2834-18,
url = {http://sigport.org/2834},
author = {Andros Tjandra; Sakriani Sakti; Satoshi Nakamura },
publisher = {IEEE SigPort},
title = {Sequence-to-Sequence ASR Optimization via Reinforcement Learning},
year = {2018} }
TY - EJOUR
T1 - Sequence-to-Sequence ASR Optimization via Reinforcement Learning
AU - Andros Tjandra; Sakriani Sakti; Satoshi Nakamura
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/2834
ER -
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura. (2018). Sequence-to-Sequence ASR Optimization via Reinforcement Learning. IEEE SigPort. http://sigport.org/2834
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, 2018. Sequence-to-Sequence ASR Optimization via Reinforcement Learning. Available at: http://sigport.org/2834.
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura. (2018). "Sequence-to-Sequence ASR Optimization via Reinforcement Learning." Web.
1. Andros Tjandra, Sakriani Sakti, Satoshi Nakamura. Sequence-to-Sequence ASR Optimization via Reinforcement Learning [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/2834

ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING


In this study, we propose advancing all-neural speech recognition by directly incorporating attention modeling within the Connectionist Temporal Classification (CTC) framework. In particular, we derive new context vectors using time convolution features to model attention as part of the CTC network. To further improve attention modeling, we utilize content information extracted from a network representing an implicit language model. Finally, we introduce vector based attention weights that are applied on context vectors across both time and their individual components.

Paper Details

Authors:
Amit Das, Jinyu Li, Rui Zhao, Yifan Gong
Submitted On:
13 April 2018 - 7:29pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

ctc_attention_slides.pdf

(89 downloads)

Subscribe

[1] Amit Das, Jinyu Li, Rui Zhao, Yifan Gong, "ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/2773. Accessed: Sep. 20, 2018.
@article{2773-18,
url = {http://sigport.org/2773},
author = {Amit Das; Jinyu Li; Rui Zhao; Yifan Gong },
publisher = {IEEE SigPort},
title = {ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING},
year = {2018} }
TY - EJOUR
T1 - ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING
AU - Amit Das; Jinyu Li; Rui Zhao; Yifan Gong
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/2773
ER -
Amit Das, Jinyu Li, Rui Zhao, Yifan Gong. (2018). ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING. IEEE SigPort. http://sigport.org/2773
Amit Das, Jinyu Li, Rui Zhao, Yifan Gong, 2018. ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING. Available at: http://sigport.org/2773.
Amit Das, Jinyu Li, Rui Zhao, Yifan Gong. (2018). "ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING." Web.
1. Amit Das, Jinyu Li, Rui Zhao, Yifan Gong. ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/2773

REINFORCEMENT LEARNING OF SPEECH RECOGNITION SYSTEM BASED ON POLICY GRADIENT AND HYPOTHESIS SELECTION

Paper Details

Authors:
Submitted On:
13 April 2018 - 6:18am
Short Link:
Type:
Event:

Document Files

ICASSP2018_poster_tkato.pdf

(59 downloads)

Subscribe

[1] , "REINFORCEMENT LEARNING OF SPEECH RECOGNITION SYSTEM BASED ON POLICY GRADIENT AND HYPOTHESIS SELECTION", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/2680. Accessed: Sep. 20, 2018.
@article{2680-18,
url = {http://sigport.org/2680},
author = { },
publisher = {IEEE SigPort},
title = {REINFORCEMENT LEARNING OF SPEECH RECOGNITION SYSTEM BASED ON POLICY GRADIENT AND HYPOTHESIS SELECTION},
year = {2018} }
TY - EJOUR
T1 - REINFORCEMENT LEARNING OF SPEECH RECOGNITION SYSTEM BASED ON POLICY GRADIENT AND HYPOTHESIS SELECTION
AU -
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/2680
ER -
. (2018). REINFORCEMENT LEARNING OF SPEECH RECOGNITION SYSTEM BASED ON POLICY GRADIENT AND HYPOTHESIS SELECTION. IEEE SigPort. http://sigport.org/2680
, 2018. REINFORCEMENT LEARNING OF SPEECH RECOGNITION SYSTEM BASED ON POLICY GRADIENT AND HYPOTHESIS SELECTION. Available at: http://sigport.org/2680.
. (2018). "REINFORCEMENT LEARNING OF SPEECH RECOGNITION SYSTEM BASED ON POLICY GRADIENT AND HYPOTHESIS SELECTION." Web.
1. . REINFORCEMENT LEARNING OF SPEECH RECOGNITION SYSTEM BASED ON POLICY GRADIENT AND HYPOTHESIS SELECTION [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/2680

Pages