Sorry, you need to enable JavaScript to visit this website.

Previous research on applying deliberation networks to automatic speech recognition has achieved excellent results. The attention decoder based deliberation model often works as a rescorer to improve first-pass recognition results, and requires the full first-pass hypothesis for second-pass deliberation. In this work, we propose a transducer-based streaming deliberation model. The joint network of a transducer decoder often receives inputs from the encoder and the prediction network. We propose to use attention to the first-pass text hypothesis as the third input to the joint network.


Most automatic speech recognition (ASR) neural network models are not suitable for mobile devices due to their large model sizes. Therefore, it is required to reduce the model size to meet the limited hardware resources. In this study, we investigate sequence-level knowledge distillation techniques of self-attention ASR models for model compression.


This work proposes a new neural network framework to simultaneously rank multiple hypotheses generated by one or more automatic speech recognition (ASR) engines for a speech utterance. Features fed in the framework not only include those calculated from the ASR information, but also involve natural language understanding (NLU) related features, such as trigger features capturing long-distance constraints between word/slot pairs and BLSTM features representing intent-sensitive sentence embedding.


Confidences are integral to ASR systems, and applied to data selection, adaptation, ranking hypotheses, arbitration etc.Hybrid ASR system is inherently a match between pronunciations and AM+LM evidence but current confidence features lack pronunciation information. We develop pronunciation embeddings to represent and factorize acoustic score in relevant bases, and demonstrate 8-10% relative reduction in false alarm (FA) on large scale tasks. We generalize to standard NLP embeddings like Glove, and show 16% relative reduction in FA in combination with Glove.


Finding visual features and suitable models for lipreading tasks that are more complex than a well-constrained vocabulary has proven challenging. This paper explores state-of-the-art Deep Neural Network architectures for lipreading based on a Sequence to Sequence Recurrent Neural Network. We report results for both hand-crafted and 2D/3D Convolutional Neural Network visual front-ends, online monotonic attention, and a joint Connectionist Temporal Classification-Sequence-to-Sequence loss.


This paper presents methods to accelerate recurrent neural network based language models (RNNLMs) for online speech recognition systems.
Firstly, a lossy compression of the past hidden layer outputs (history vector) with caching is introduced in order to reduce the number of LM queries.
Next, RNNLM computations are deployed in a CPU-GPU hybrid manner, which computes each layer of the model on a more advantageous platform.
The added overhead by data exchanges between CPU and GPU is compensated through a frame-wise batching strategy.


The paper presents a new approach to extracting useful information from out-of-vocabulary (OOV) speech regions in ASR system output. The system makes use of a hybrid decoding network with both words and sub-word units. In the decoded lattices, candidates for OOV regions are identified


Lattice-rescoring is a common approach to take advantage of recurrent neural language models in ASR, where a wordlattice is generated from 1st-pass decoding and the lattice is then rescored with a neural model, and an n-gram approximation method is usually adopted to limit the search space. In this work, we describe a pruned lattice-rescoring algorithm for ASR, improving the n-gram approximation method. The pruned algorithm further limits the search space and uses heuristic search to pick better histories when expanding the lattice.