Sorry, you need to enable JavaScript to visit this website.

Robust Speech Recognition (SPE-ROBU)

Multi-scale Octave Convolutions for Robust Speech Recognition


We propose a multi-scale octave convolution layer to learn robust speech representations efficiently. Octave convolutions were introduced by Chen et al [1] in the computer vision field to reduce the spatial redundancy of the feature maps by decomposing the output of a convolutional layer into feature maps at two different spatial resolutions, one octave apart. This approach improved the efficiency as well as the accuracy of the CNN models. The accuracy gain was attributed to the enlargement of the receptive field in the original input space.

Paper Details

Authors:
Joanna Rownicka, Peter Bell, Steve Renals
Submitted On:
16 May 2020 - 8:58am
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

ICASSP2020_JRownicka_slides.pdf

(34)

Subscribe

[1] Joanna Rownicka, Peter Bell, Steve Renals, "Multi-scale Octave Convolutions for Robust Speech Recognition", IEEE SigPort, 2020. [Online]. Available: http://sigport.org/5374. Accessed: Sep. 25, 2020.
@article{5374-20,
url = {http://sigport.org/5374},
author = {Joanna Rownicka; Peter Bell; Steve Renals },
publisher = {IEEE SigPort},
title = {Multi-scale Octave Convolutions for Robust Speech Recognition},
year = {2020} }
TY - EJOUR
T1 - Multi-scale Octave Convolutions for Robust Speech Recognition
AU - Joanna Rownicka; Peter Bell; Steve Renals
PY - 2020
PB - IEEE SigPort
UR - http://sigport.org/5374
ER -
Joanna Rownicka, Peter Bell, Steve Renals. (2020). Multi-scale Octave Convolutions for Robust Speech Recognition. IEEE SigPort. http://sigport.org/5374
Joanna Rownicka, Peter Bell, Steve Renals, 2020. Multi-scale Octave Convolutions for Robust Speech Recognition. Available at: http://sigport.org/5374.
Joanna Rownicka, Peter Bell, Steve Renals. (2020). "Multi-scale Octave Convolutions for Robust Speech Recognition." Web.
1. Joanna Rownicka, Peter Bell, Steve Renals. Multi-scale Octave Convolutions for Robust Speech Recognition [Internet]. IEEE SigPort; 2020. Available from : http://sigport.org/5374

An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-Talker Single Channel Audio-Visual ASR


In this paper, we analyzed how audio-visual speech enhancement can help to perform the ASR task in a cocktail party scenario. Therefore we considered two simple end-to-end LSTM-based models that perform single-channel audiovisual speech enhancement and phone recognition respectively. Then, we studied how the two models interact, and how to train them jointly affects the final result.We analyzed different training strategies that reveal some interesting and unexpected behaviors.

Paper Details

Authors:
Luca Pasa, Leonardo Badino
Submitted On:
13 May 2020 - 6:28pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

slides_paper#3109.pdf

(43)

Subscribe

[1] Luca Pasa, Leonardo Badino, "An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-Talker Single Channel Audio-Visual ASR", IEEE SigPort, 2020. [Online]. Available: http://sigport.org/5164. Accessed: Sep. 25, 2020.
@article{5164-20,
url = {http://sigport.org/5164},
author = {Luca Pasa; Leonardo Badino },
publisher = {IEEE SigPort},
title = {An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-Talker Single Channel Audio-Visual ASR},
year = {2020} }
TY - EJOUR
T1 - An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-Talker Single Channel Audio-Visual ASR
AU - Luca Pasa; Leonardo Badino
PY - 2020
PB - IEEE SigPort
UR - http://sigport.org/5164
ER -
Luca Pasa, Leonardo Badino. (2020). An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-Talker Single Channel Audio-Visual ASR. IEEE SigPort. http://sigport.org/5164
Luca Pasa, Leonardo Badino, 2020. An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-Talker Single Channel Audio-Visual ASR. Available at: http://sigport.org/5164.
Luca Pasa, Leonardo Badino. (2020). "An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-Talker Single Channel Audio-Visual ASR." Web.
1. Luca Pasa, Leonardo Badino. An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-Talker Single Channel Audio-Visual ASR [Internet]. IEEE SigPort; 2020. Available from : http://sigport.org/5164

Small energy masking for improved neural network training for end-to-end speech recognition


In this paper, we present a Small Energy Masking (SEM) algorithm, which masks inputs having values below a certain threshold. More specifically, a time-frequency bin is masked if the filterbank energy in this bin is less than a certain energy threshold. A uniform distribution is employed to randomly generate the ratio of this energy threshold to the peak filterbank energy of each utterance in decibels. The unmasked feature elements are scaled so that the total sum of the feature values remain the same through this masking procedure.

Paper Details

Authors:
Chanwoo Kim, Kwangyoun Kim, Sathish Reddy Indurthi
Submitted On:
5 May 2020 - 5:27pm
Short Link:
Type:
Event:
Presenter's Name:
Document Year:
Cite

Document Files

20200508_icassp_small_energy_masking_paper_3965_presentation.pdf

(57)

Subscribe

[1] Chanwoo Kim, Kwangyoun Kim, Sathish Reddy Indurthi, "Small energy masking for improved neural network training for end-to-end speech recognition", IEEE SigPort, 2020. [Online]. Available: http://sigport.org/5125. Accessed: Sep. 25, 2020.
@article{5125-20,
url = {http://sigport.org/5125},
author = {Chanwoo Kim; Kwangyoun Kim; Sathish Reddy Indurthi },
publisher = {IEEE SigPort},
title = {Small energy masking for improved neural network training for end-to-end speech recognition},
year = {2020} }
TY - EJOUR
T1 - Small energy masking for improved neural network training for end-to-end speech recognition
AU - Chanwoo Kim; Kwangyoun Kim; Sathish Reddy Indurthi
PY - 2020
PB - IEEE SigPort
UR - http://sigport.org/5125
ER -
Chanwoo Kim, Kwangyoun Kim, Sathish Reddy Indurthi. (2020). Small energy masking for improved neural network training for end-to-end speech recognition. IEEE SigPort. http://sigport.org/5125
Chanwoo Kim, Kwangyoun Kim, Sathish Reddy Indurthi, 2020. Small energy masking for improved neural network training for end-to-end speech recognition. Available at: http://sigport.org/5125.
Chanwoo Kim, Kwangyoun Kim, Sathish Reddy Indurthi. (2020). "Small energy masking for improved neural network training for end-to-end speech recognition." Web.
1. Chanwoo Kim, Kwangyoun Kim, Sathish Reddy Indurthi. Small energy masking for improved neural network training for end-to-end speech recognition [Internet]. IEEE SigPort; 2020. Available from : http://sigport.org/5125

Learning Discriminative Features in Sequence Training without Requiring Framewise Labelled Data

Paper Details

Authors:
Jun Wang, Dan Su, Jie Chen, Shulin Feng, Dongpeng Ma, Na Li, Dong Yu
Submitted On:
15 May 2019 - 3:11am
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

Slides.pdf

(642)

Subscribe

[1] Jun Wang, Dan Su, Jie Chen, Shulin Feng, Dongpeng Ma, Na Li, Dong Yu, "Learning Discriminative Features in Sequence Training without Requiring Framewise Labelled Data", IEEE SigPort, 2019. [Online]. Available: http://sigport.org/4513. Accessed: Sep. 25, 2020.
@article{4513-19,
url = {http://sigport.org/4513},
author = {Jun Wang; Dan Su; Jie Chen; Shulin Feng; Dongpeng Ma; Na Li; Dong Yu },
publisher = {IEEE SigPort},
title = {Learning Discriminative Features in Sequence Training without Requiring Framewise Labelled Data},
year = {2019} }
TY - EJOUR
T1 - Learning Discriminative Features in Sequence Training without Requiring Framewise Labelled Data
AU - Jun Wang; Dan Su; Jie Chen; Shulin Feng; Dongpeng Ma; Na Li; Dong Yu
PY - 2019
PB - IEEE SigPort
UR - http://sigport.org/4513
ER -
Jun Wang, Dan Su, Jie Chen, Shulin Feng, Dongpeng Ma, Na Li, Dong Yu. (2019). Learning Discriminative Features in Sequence Training without Requiring Framewise Labelled Data. IEEE SigPort. http://sigport.org/4513
Jun Wang, Dan Su, Jie Chen, Shulin Feng, Dongpeng Ma, Na Li, Dong Yu, 2019. Learning Discriminative Features in Sequence Training without Requiring Framewise Labelled Data. Available at: http://sigport.org/4513.
Jun Wang, Dan Su, Jie Chen, Shulin Feng, Dongpeng Ma, Na Li, Dong Yu. (2019). "Learning Discriminative Features in Sequence Training without Requiring Framewise Labelled Data." Web.
1. Jun Wang, Dan Su, Jie Chen, Shulin Feng, Dongpeng Ma, Na Li, Dong Yu. Learning Discriminative Features in Sequence Training without Requiring Framewise Labelled Data [Internet]. IEEE SigPort; 2019. Available from : http://sigport.org/4513

Conditional Teacher-Student Learning


The teacher-student (T/S) learning has been shown to be effective for a variety of problems such as domain adaptation and model compression. One shortcoming of the T/S learning is that a teacher model, not always perfect, sporadically produces wrong guidance in form of posterior probabilities that misleads the student model towards a suboptimal performance.

Paper Details

Authors:
Zhong Meng, Jinyu Li, Yong Zhao, Yifan Gong
Submitted On:
12 May 2019 - 9:23pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

cts_poster.pptx

(124)

Subscribe

[1] Zhong Meng, Jinyu Li, Yong Zhao, Yifan Gong, "Conditional Teacher-Student Learning", IEEE SigPort, 2019. [Online]. Available: http://sigport.org/4472. Accessed: Sep. 25, 2020.
@article{4472-19,
url = {http://sigport.org/4472},
author = {Zhong Meng; Jinyu Li; Yong Zhao; Yifan Gong },
publisher = {IEEE SigPort},
title = {Conditional Teacher-Student Learning},
year = {2019} }
TY - EJOUR
T1 - Conditional Teacher-Student Learning
AU - Zhong Meng; Jinyu Li; Yong Zhao; Yifan Gong
PY - 2019
PB - IEEE SigPort
UR - http://sigport.org/4472
ER -
Zhong Meng, Jinyu Li, Yong Zhao, Yifan Gong. (2019). Conditional Teacher-Student Learning. IEEE SigPort. http://sigport.org/4472
Zhong Meng, Jinyu Li, Yong Zhao, Yifan Gong, 2019. Conditional Teacher-Student Learning. Available at: http://sigport.org/4472.
Zhong Meng, Jinyu Li, Yong Zhao, Yifan Gong. (2019). "Conditional Teacher-Student Learning." Web.
1. Zhong Meng, Jinyu Li, Yong Zhao, Yifan Gong. Conditional Teacher-Student Learning [Internet]. IEEE SigPort; 2019. Available from : http://sigport.org/4472

On reducing the effect of speaker overlap for CHiME-5


The CHiME-5 speech separation and recognition challenge was recently shown to pose a difficult task for the current automatic speech recognition systems.
Speaker overlap was one of the main difficulties of the challenge. The presence of noise, reverberation and the moving speakers have made the traditional source separation methods ineffective in improving the recognition accuracy.
In this paper we have explored several enhancement strategies aimed to reduce the effect of speaker overlap for CHiME-5 without performing source separation.

Paper Details

Authors:
Catalin Zorila, Rama Doddipatla
Submitted On:
11 May 2019 - 8:58pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

poster presentation

(112)

Subscribe

[1] Catalin Zorila, Rama Doddipatla, "On reducing the effect of speaker overlap for CHiME-5", IEEE SigPort, 2019. [Online]. Available: http://sigport.org/4454. Accessed: Sep. 25, 2020.
@article{4454-19,
url = {http://sigport.org/4454},
author = {Catalin Zorila; Rama Doddipatla },
publisher = {IEEE SigPort},
title = {On reducing the effect of speaker overlap for CHiME-5},
year = {2019} }
TY - EJOUR
T1 - On reducing the effect of speaker overlap for CHiME-5
AU - Catalin Zorila; Rama Doddipatla
PY - 2019
PB - IEEE SigPort
UR - http://sigport.org/4454
ER -
Catalin Zorila, Rama Doddipatla. (2019). On reducing the effect of speaker overlap for CHiME-5. IEEE SigPort. http://sigport.org/4454
Catalin Zorila, Rama Doddipatla, 2019. On reducing the effect of speaker overlap for CHiME-5. Available at: http://sigport.org/4454.
Catalin Zorila, Rama Doddipatla. (2019). "On reducing the effect of speaker overlap for CHiME-5." Web.
1. Catalin Zorila, Rama Doddipatla. On reducing the effect of speaker overlap for CHiME-5 [Internet]. IEEE SigPort; 2019. Available from : http://sigport.org/4454

Analyzing Uncertainties in Speech Recognition Using Dropout


The performance of Automatic Speech Recognition (ASR) systems is often measured using Word Error Rates (WER) which requires time-consuming and expensive manually transcribed data. In this paper, we use state-of-the-art ASR systems based on Deep Neural Networks (DNN) and propose a novel framework which uses ``Dropout'' at the test time to model uncertainty in prediction hypotheses. We systematically exploit this uncertainty to estimate WER without the need for explicit transcriptions.

Paper Details

Authors:
Hervé Bourlard
Submitted On:
11 May 2019 - 8:55am
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

Poster_avyas_ICASSP_2019.pdf

(114)

Subscribe

[1] Hervé Bourlard , "Analyzing Uncertainties in Speech Recognition Using Dropout", IEEE SigPort, 2019. [Online]. Available: http://sigport.org/4441. Accessed: Sep. 25, 2020.
@article{4441-19,
url = {http://sigport.org/4441},
author = { Hervé Bourlard },
publisher = {IEEE SigPort},
title = {Analyzing Uncertainties in Speech Recognition Using Dropout},
year = {2019} }
TY - EJOUR
T1 - Analyzing Uncertainties in Speech Recognition Using Dropout
AU - Hervé Bourlard
PY - 2019
PB - IEEE SigPort
UR - http://sigport.org/4441
ER -
Hervé Bourlard . (2019). Analyzing Uncertainties in Speech Recognition Using Dropout. IEEE SigPort. http://sigport.org/4441
Hervé Bourlard , 2019. Analyzing Uncertainties in Speech Recognition Using Dropout. Available at: http://sigport.org/4441.
Hervé Bourlard . (2019). "Analyzing Uncertainties in Speech Recognition Using Dropout." Web.
1. Hervé Bourlard . Analyzing Uncertainties in Speech Recognition Using Dropout [Internet]. IEEE SigPort; 2019. Available from : http://sigport.org/4441

MULTI-GEOMETRY SPATIAL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION


The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade speech enhancement performance when there is an array geometry mismatch between design and test conditions. Moreover, such speech enhancement techniques do not always yield ASR accuracy improvement due to the difference between speech enhancement and ASR optimization objectives.

Paper Details

Authors:
Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister
Submitted On:
10 May 2019 - 6:38pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

poster file

(120)

manuscript file

(110)

Subscribe

[1] Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister, "MULTI-GEOMETRY SPATIAL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION", IEEE SigPort, 2019. [Online]. Available: http://sigport.org/4420. Accessed: Sep. 25, 2020.
@article{4420-19,
url = {http://sigport.org/4420},
author = {Shiva Sundaram; Nikko Strom; Bjorn Hoffmeister },
publisher = {IEEE SigPort},
title = {MULTI-GEOMETRY SPATIAL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION},
year = {2019} }
TY - EJOUR
T1 - MULTI-GEOMETRY SPATIAL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION
AU - Shiva Sundaram; Nikko Strom; Bjorn Hoffmeister
PY - 2019
PB - IEEE SigPort
UR - http://sigport.org/4420
ER -
Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister. (2019). MULTI-GEOMETRY SPATIAL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION. IEEE SigPort. http://sigport.org/4420
Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister, 2019. MULTI-GEOMETRY SPATIAL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION. Available at: http://sigport.org/4420.
Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister. (2019). "MULTI-GEOMETRY SPATIAL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION." Web.
1. Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister. MULTI-GEOMETRY SPATIAL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION [Internet]. IEEE SigPort; 2019. Available from : http://sigport.org/4420

FREQUENCY DOMAIN MULTI-CHANNEL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION


Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for speech enhancement in order to improve robustness against noise or reverberation. However, such speech enhancement techniques do not always yield ASR accuracy improvement because the optimization criterion for speech enhancement is not directly relevant to the ASR objective. In this work, we develop new acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly.

Paper Details

Authors:
Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister
Submitted On:
10 May 2019 - 6:36pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

poster file

(117)

manuscript file

(115)

Subscribe

[1] Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister, "FREQUENCY DOMAIN MULTI-CHANNEL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION", IEEE SigPort, 2019. [Online]. Available: http://sigport.org/4419. Accessed: Sep. 25, 2020.
@article{4419-19,
url = {http://sigport.org/4419},
author = {Shiva Sundaram; Nikko Strom; Bjorn Hoffmeister },
publisher = {IEEE SigPort},
title = {FREQUENCY DOMAIN MULTI-CHANNEL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION},
year = {2019} }
TY - EJOUR
T1 - FREQUENCY DOMAIN MULTI-CHANNEL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION
AU - Shiva Sundaram; Nikko Strom; Bjorn Hoffmeister
PY - 2019
PB - IEEE SigPort
UR - http://sigport.org/4419
ER -
Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister. (2019). FREQUENCY DOMAIN MULTI-CHANNEL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION. IEEE SigPort. http://sigport.org/4419
Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister, 2019. FREQUENCY DOMAIN MULTI-CHANNEL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION. Available at: http://sigport.org/4419.
Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister. (2019). "FREQUENCY DOMAIN MULTI-CHANNEL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION." Web.
1. Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister. FREQUENCY DOMAIN MULTI-CHANNEL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION [Internet]. IEEE SigPort; 2019. Available from : http://sigport.org/4419

REINFORCEMENT LEARNING BASED SPEECH ENHANCEMENT FOR ROBUST SPEECH RECOGNITION

Paper Details

Authors:
Yih-Liang Shen, Chao-Yuan Huang, Syu-Siang Wang, Yu Tsao, Hsin-Min Wang, Tai-Shih Chi,
Submitted On:
9 May 2019 - 12:29pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

REINFORCEMENT LEARNING BASED SPEECH ENHANCEMENT FOR ROBUST SPEECH RECOGNITION.pdf

(97)

Subscribe

[1] Yih-Liang Shen, Chao-Yuan Huang, Syu-Siang Wang, Yu Tsao, Hsin-Min Wang, Tai-Shih Chi,, "REINFORCEMENT LEARNING BASED SPEECH ENHANCEMENT FOR ROBUST SPEECH RECOGNITION", IEEE SigPort, 2019. [Online]. Available: http://sigport.org/4221. Accessed: Sep. 25, 2020.
@article{4221-19,
url = {http://sigport.org/4221},
author = {Yih-Liang Shen; Chao-Yuan Huang; Syu-Siang Wang; Yu Tsao; Hsin-Min Wang; Tai-Shih Chi; },
publisher = {IEEE SigPort},
title = {REINFORCEMENT LEARNING BASED SPEECH ENHANCEMENT FOR ROBUST SPEECH RECOGNITION},
year = {2019} }
TY - EJOUR
T1 - REINFORCEMENT LEARNING BASED SPEECH ENHANCEMENT FOR ROBUST SPEECH RECOGNITION
AU - Yih-Liang Shen; Chao-Yuan Huang; Syu-Siang Wang; Yu Tsao; Hsin-Min Wang; Tai-Shih Chi;
PY - 2019
PB - IEEE SigPort
UR - http://sigport.org/4221
ER -
Yih-Liang Shen, Chao-Yuan Huang, Syu-Siang Wang, Yu Tsao, Hsin-Min Wang, Tai-Shih Chi,. (2019). REINFORCEMENT LEARNING BASED SPEECH ENHANCEMENT FOR ROBUST SPEECH RECOGNITION. IEEE SigPort. http://sigport.org/4221
Yih-Liang Shen, Chao-Yuan Huang, Syu-Siang Wang, Yu Tsao, Hsin-Min Wang, Tai-Shih Chi,, 2019. REINFORCEMENT LEARNING BASED SPEECH ENHANCEMENT FOR ROBUST SPEECH RECOGNITION. Available at: http://sigport.org/4221.
Yih-Liang Shen, Chao-Yuan Huang, Syu-Siang Wang, Yu Tsao, Hsin-Min Wang, Tai-Shih Chi,. (2019). "REINFORCEMENT LEARNING BASED SPEECH ENHANCEMENT FOR ROBUST SPEECH RECOGNITION." Web.
1. Yih-Liang Shen, Chao-Yuan Huang, Syu-Siang Wang, Yu Tsao, Hsin-Min Wang, Tai-Shih Chi,. REINFORCEMENT LEARNING BASED SPEECH ENHANCEMENT FOR ROBUST SPEECH RECOGNITION [Internet]. IEEE SigPort; 2019. Available from : http://sigport.org/4221

Pages