Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

Speaker-aware Training of Attention-based End-to-End Speech Recognition using Neural Speaker Embeddings

In speaker-aware training, a speaker embedding is appended to DNN input features. This allows the DNN to effectively learn representations, which are robust to speaker variability.
We apply speaker-aware training to attention-based end- to-end speech recognition. We show that it can improve over a purely end-to-end baseline. We also propose speaker-aware training as a viable method to leverage untranscribed, speaker annotated data.

icassp2020-slides.pdf

icassp2020-slides.pdf (399)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

55 Views

Small energy masking for improved neural network training for end-to-end speech recognition

In this paper, we present a Small Energy Masking (SEM) algorithm, which masks inputs having values below a certain threshold. More specifically, a time-frequency bin is masked if the filterbank energy in this bin is less than a certain energy threshold. A uniform distribution is employed to randomly generate the ratio of this energy threshold to the peak filterbank energy of each utterance in decibels. The unmasked feature elements are scaled so that the total sum of the feature values remain the same through this masking procedure.

20200508_icassp_small_energy_masking_paper_3965_presentation.pdf

20200508_icassp_small_energy_masking_paper_3965_presentation.pdf (499)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)
Robust Speech Recognition (SPE-ROBU)

34 Views

Unsupervised Pre-training of Bidirectional Speech Encoders via Masked Reconstruction

Read more about Unsupervised Pre-training of Bidirectional Speech Encoders via Masked Reconstruction
2 comments
Log in to post comments

main.pdf

main.pdf (446)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

59 Views

END-TO-END FEEDBACK LOSS IN SPEECH CHAIN FRAMEWORK VIA STRAIGHT-THROUGH ESTIMATOR

Read more about END-TO-END FEEDBACK LOSS IN SPEECH CHAIN FRAMEWORK VIA STRAIGHT-THROUGH ESTIMATOR
Log in to post comments

The speech chain mechanism integrates automatic speech recognition (ASR) and text-to-speech synthesis (TTS) modules into a single cycle during training. In our previous work, we applied a speech chain mechanism as a semi-supervised learning. It provides the ability for ASR and TTS to assist each other when they receive unpaired data and let them infer the missing pair and optimize the model with reconstruction loss.

ICASSP19_Poster_V1.pdf

ICASSP19_Poster_V1.pdf (471)

Categories:: General Topics in Speech Recognition (SPE-GASR)
Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

71 Views

PROMISING ACCURATE PREFIX BOOSTING FOR SEQUENCE-TO-SEQUENCE ASR

Read more about PROMISING ACCURATE PREFIX BOOSTING FOR SEQUENCE-TO-SEQUENCE ASR
Log in to post comments

PAPB_icassp-expanded-v2.pdf

PAPB_icassp-expanded-v2.pdf (442)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

11 Views

PROMISING ACCURATE PREFIX BOOSTING FOR SEQUENCE-TO-SEQUENCE ASR

Read more about PROMISING ACCURATE PREFIX BOOSTING FOR SEQUENCE-TO-SEQUENCE ASR
Log in to post comments

PAPB_icassp-expanded.pdf

PAPB_icassp-expanded.pdf (425)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

9 Views

Adversarial Speaker Adaptation

Read more about Adversarial Speaker Adaptation
Log in to post comments

We propose a novel adversarial speaker adaptation (ASA) scheme, in which adversarial learning is applied to regularize the distribution of deep hidden features in a speaker-dependent (SD) deep neural network (DNN) acoustic model to be close to that of a fixed speaker-independent (SI) DNN acoustic model during adaptation. An additional discriminator network is introduced to distinguish the deep features generated by the SD model from those produced by the SI model.

asa_oral_v3.pptx

asa_oral_v3.pptx (581)

Categories:: Speech Processing
Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)
Audio and Acoustic Signal Processing
Machine Learning for Signal Processing

21 Views

Conditional Teacher-Student Learning

Read more about Conditional Teacher-Student Learning
Log in to post comments

The teacher-student (T/S) learning has been shown to be effective for a variety of problems such as domain adaptation and model compression. One shortcoming of the T/S learning is that a teacher model, not always perfect, sporadically produces wrong guidance in form of posterior probabilities that misleads the student model towards a suboptimal performance.

cts_poster.pptx

cts_poster.pptx (524)

Categories:: Speech Processing
Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)
Robust Speech Recognition (SPE-ROBU)
Machine Learning for Signal Processing
Audio and Acoustic Signal Processing

52 Views

PROMISING ACCURATE PREFIX BOOSTING FOR SEQUENCE-TO-SEQUENCE ASR

Read more about PROMISING ACCURATE PREFIX BOOSTING FOR SEQUENCE-TO-SEQUENCE ASR
Log in to post comments

PAPB_icassp-expanded.pdf

PAPB_icassp-expanded.pdf (366)

Categories:: Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

8 Views

MULTI-GEOMETRY SPATIAL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION

Read more about MULTI-GEOMETRY SPATIAL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION
Log in to post comments

The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade speech enhancement performance when there is an array geometry mismatch between design and test conditions. Moreover, such speech enhancement techniques do not always yield ASR accuracy improvement due to the difference between speech enhancement and ASR optimization objectives.