Sorry, you need to enable JavaScript to visit this website.

We present a bidirectional unsupervised model pre-training (UPT) method and apply it to children’s automatic speech recognition (ASR). An obstacle to improving child ASR is the scarcity of child speech databases. A common approach to alleviate this problem is model pre-training using data from adult speech. Pre-training can be done using supervised (SPT) or unsupervised methods, depending on the availability of annotations. Typically, SPT performs better. In this paper, we focus on UPT to address the situations when pre-training data are unlabeled.

Categories:
4 Views

Deep Neural Network-based speech recognition systems are widely used in most speech processing applications. To achieve better model robustness and accuracy, these networks are constructed with millions of parameters, making them storage and compute-intensive. In this paper, we propose Bin & Quant (B&Q), a compression technique using which we were able to reduce the Deep Speech 2 speech recognition model size by 7 times for a negligible loss in accuracy.

Categories:
25 Views

We propose a method for zero-resource domain adaptation of DNN acoustic models, for use in low-resource situations where the only in-language training data available may be poorly matched to the intended target domain. Our method uses a multi-lingual model in which several DNN layers are shared between languages. This architecture enables domain adaptation transforms learned for one well-resourced language to be applied to an entirely different low- resource language.

Categories:
16 Views

We present a novel lipreading system that improves on the task of speaker-independent word recognition by decoupling motion and content dynamics. We achieve this by implementing a deep learning architecture that uses two distinct pipelines to process motion and content and subsequently merges them, implementing an end-to-end trainable system that performs fusion of independently learned representations. We obtain a average relative word accuracy improvement of ≈6.8% on unseen speakers and of ≈3.3% on known speakers, with respect to a baseline which uses a standard architecture.

Categories:
19 Views

Most automatic speech recognition (ASR) neural network models are not suitable for mobile devices due to their large model sizes. Therefore, it is required to reduce the model size to meet the limited hardware resources. In this study, we investigate sequence-level knowledge distillation techniques of self-attention ASR models for model compression.

Categories:
91 Views

Mismatched crowdsourcing based probabilistic human transcription has been proposed recently for training and adapting acoustic models for zero-resourced languages where we do not have any native transcriptions. This paper describes a machine transcription based phone recognition system for recognizing zero-resourced languages and compares it with baseline systems of MAP adaptation and semi-supervised self training.

Categories:
5 Views

Deep learning has significantly advanced state-of-the-art of speech
recognition in the past few years. However, compared to conventional
Gaussian mixture acoustic models, neural network models are
usually much larger, and are therefore not very deployable in embedded
devices. Previously, we investigated a compact highway deep
neural network (HDNN) for acoustic modelling, which is a type
of depth-gated feedforward neural network. We have shown that
HDNN-based acoustic models can achieve comparable recognition

Categories:
4 Views