On Training the Recurrent Neural Network Encoder-Decoder for Large Vocabulary End-to-end Speech Recognition

Recently, there has been an increasing interest in end-to-end speech
recognition using neural networks, with no reliance on hidden
Markov models (HMMs) for sequence modelling as in the standard
hybrid framework. The recurrent neural network (RNN) encoder-decoder
is such a model, performing sequence to sequence mapping
without any predefined alignment. This model first transforms the
input sequence into a fixed length vector representation, from which
the decoder recovers the output sequence. In this paper, we extend
our previous work on this model for large vocabulary end-to-end
speech recognition. We first present a more effective stochastic gradient
decent (SGD) learning rate schedule that can significantly improve
the recognition accuracy. We then extend the decoder with
long memory by introducing another recurrent layer that performs
implicit language modelling. Finally, we demonstrate that using
multiple recurrent layers in the encoder can reduce the word error
rate. Our experiments were carried out on the Switchboard corpus
using a training set of around 300 hours of transcribed audio
data, and we have achieved significantly higher recognition accuracy,
thereby reduced the gap compared to the hybrid baseline.

liang_icassp16_slides.pdf

liang_icassp16_slides.pdf (670)

Thumbs Up

CITE

Documents

Presentation Slides

On Training the Recurrent Neural Network Encoder-Decoder for Large Vocabulary End-to-end Speech Recognition

liang_icassp16_slides.pdf

QUESTIONS?