Sorry, you need to enable JavaScript to visit this website.

Connectionist temporal classification (CTC) is a popular sequence prediction approach for automatic speech recognition that is typically used with models based on recurrent neural networks (RNNs). We explore whether deep convolutional neural networks (CNNs) can be used effectively instead of RNNs as the "encoder" in CTC. CNNs lack an explicit representation of the entire sequence, but have the advantage that they are much faster to train. We present an exploration of CNNs as encoders for CTC models, in the context of character-based (lexicon-free) automatic speech recognition.

Categories:
10 Views

When it comes to speech recognition for voice search, it would be
advantageous to take into account application information associated
with speech queries. However, in practice, the vast majority
of queries typically lack such annotations, posing a challenge to
train domain-specific language models (LMs). To obtain robust domain
LMs, typically a LM which has been pre-trained on general
data will be adapted to specific domains. We propose four adaptation
schemes to improve the domain performance of long shortterm

Categories:
91 Views

This paper explores the possibility to use grapheme-based word and sub-word models in the task of spoken term detection (STD). The usage of grapheme models eliminates the need for expert-prepared pronunciation lexicons (which are often far from complete) and/or trainable grapheme-to-phoneme (G2P) algorithms that are frequently rather inaccurate, especially for rare words (words coming from a~different language). Moreover, the G2P conversion of the search terms that need to be performed on-line can substantially increase the response time of the STD system.

Categories:
74 Views

This paper establishs CTC-based systems on Chinese Mandarin ASR task, three different level output units are explored: characters, context independent phonemes and context dependent phoneme. To make training stable we propose Newbob-Trn strategy, furthermore, blank label prior cost is proposed to improve the performance. Further, we establish the CTC-trained UniLSTM-RC model, which ensures the real-time requirement of an online system, meanwhile, brings performance gain on Chinese Mandarin ASR task.

Categories:
19 Views

Traditional hybrid DNN-HMM based ASR system for keywords spotting which models HMM states are not flexible to optimize for a specific language. In this paper, we construct an end-to-end acoustic model based ASR for keywords spotting in Mandarin. This model is constructed by LSTM-RNN and trained with objective measure of connectionist temporal classification. The input of the network is feature sequences, and the output the probabilities of the initials and finals of Mandarin syllables.

Categories:
31 Views

Improved speech recognition performance can often be obtained by combining multiple systems
together. Joint decoding, where scores from multiple systems are combined during decoding rather
than combining hypotheses, is one efficient approach for system combination. In standard joint
decoding the frame log-likelihoods from each system are used as the scores. These scores are then
weighted and summed to yield the final score for a frame. The system combination weights for this

Categories:
13 Views

In the past 5 years significant advances in Large Vocabulary Speech Recognition (LVSR), Deep Learning (DL) and Spoken Language Understanding (SLU), along with the explosive growth of wireless network bandwidth have given rise to three compelling Conversational AI agents that are available on the Andriod, iOS and Microsoft Smartphones. Conversational AI agents such as Google Now, Apple Siri and Microsoft Cortana are now the most preferred way of mobile web search and to perform command and control of the various smartphone apps.

Categories:
117 Views

Pages