Robust Recognition of Speech with Background Music in Acoustically Under-Resourced Scenarios

This paper addresses the task of Automatic Speech Recognition
(ASR) with music in the background. We consider two different
situations: 1) scenarios with very small amount of labeled training
utterances (duration 1 hour) and 2) scenarios with large amount of
labeled training utterances (duration 132 hours). In these situations,
we aim to achieve robust recognition. To this end we investigate
the following techniques: a) multi-condition training of the acoustic
model, b) denoising autoencoders for feature enhancement and c)
joint training of both above mentioned techniques.
We demonstrate that the considered methods can be successfully
trained with the small amount of labeled acoustic data. We present
substantially improved performance compared to acoustic models
trained on clean speech. Further, we show a significant increase of
accuracy in the under-resourced scenario, when utilizing additional
amount of non-labeled data. Here, the non-labeled dataset is used to
improve the accuracy of the feature enhancement via autoencoders.
Subsequently, the autoencoders are jointly fine-tuned along with the
acoustic model using the small amount of labeled utterances.

ICASSP2018_Paper1052_MalekZdanskyCerva.pdf

ICASSP2018_Paper1052_MalekZdanskyCerva.pdf (489)

Thumbs Up

CITE

Documents

Poster

Robust Recognition of Speech with Background Music in Acoustically Under-Resourced Scenarios

ICASSP2018_Paper1052_MalekZdanskyCerva.pdf

QUESTIONS?