Sorry, you need to enable JavaScript to visit this website.

Acoustic modeling of speech waveform based on multi-resolution, neural network signal processing

Citation Author(s):
Zoltán Tüske, Ralf Schlüter, Hermann Ney
Submitted by:
Zoltan Tuske
Last updated:
2 May 2018 - 3:00pm
Document Type:
Presentation Slides
Document Year:
2018
Event:
Presenters:
Zoltán Tüske
 

Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a sec- ond level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations. Automatic speech recognition (ASR) experiments show significant word error rate reduction over our previous best acoustic model trained in the signal domain directly. Although we use only 250 hours of speech, the data-driven NN based speech signal processing performs nearly equally to traditional handcrafted feature extractors. In additional experiments, we also test segment-level feature normalization techniques on NN derived features, which improve the results further. However, the porting of speech representations derived by a feed-forward NN to a LSTM back-end model indicates much less robustness of the NN front-end compared to the standard feature extractors. Analysis of the weights in the proposed new layer reveals that the NN prefers both multi-resolution and modulation spectrum representations.

up
0 users have voted: