Sorry, you need to enable JavaScript to visit this website.

Language models (LM) have been widely deployed in modern ASR systems. The LM is often trained by minimizing its perplexity on speech transcript. However, few studies try to discriminate a "gold" reference against inferior hypotheses. In this work, we propose a large margin language model (LMLM). LMLM is a general framework that enforces an LM to assign a higher score to the "gold" reference, and a lower one to the inferior hypothesis. The general framework is applied to three pretrained LM architectures: left-to-right LSTM, transformer encoder, and transformer decoder.


Neural network language model (NNLM) is an essential component of industrial ASR systems. One important challenge of training an NNLM is to leverage between scaling the learning process and handling big data. Conventional approaches such as block momentum provides a blockwise model update filtering (BMUF) process and achieves almost linear speedups with no performance degradation for speech recognition.


Building multilingual and crosslingual models help bring different languages together in a language universal space. It allows models to share parameters and transfer knowledge across languages, enabling faster and better adaptation to a new language. These approaches are particularly useful for low resource languages. In this paper, we propose a phoneme-level language model that can be used multilingually and for crosslingual adaptation to a target language.


The subjectivity and variability exist in the human emotion perception differs from person to person. In this work, we propose a framework that models the majority of emotion annotation integrated with modeling of subjectivity in improving emotion categorization performances. Our method achieves a promising accuracy of 61.48% on a four-class emotion recognition task. To the best of our knowledge, while there are many works in studying annotator subjectivity, this is one of the first works that have explicitly modeled jointly the consensus with individuality in emotion perception to demonstrate its improvement in classifying emotion in a benchmark corpus.

In our immediate future work, we will evaluate the proposed framework on other public large-scaled emotional database with multiple annotators, e.g., NNIME, to further justify its robustness. We also plan to extend our framework to includ other behavior attributes, e.g., lexical content and body movements. Furthermore, the subjective nature of emotion perception has been shown to be related to the rater personality, a joint modeling of rater’s characteristics with his/her subjectivity in emotion perception may lead to further advancement in robust emotion recognition.


Recurrent neural network language models (RNNLMs) have shown superior performance across a range of speech recognition tasks. At the heart of all RNNLMs, the activation functions play a vital role to control the information flows and tracking longer history contexts that are useful for predicting the following words. Long short-term memory (LSTM) units are well known for such ability and thus widely used in current RNNLMs. However, the deterministic parameter estimates in LSTM RNNLMs are prone to over-fitting and poor generalization when given limited training data.


In this paper we describe an extension of the Kaldi software toolkit to support neural-based language modeling, intended for use in automatic speech recognition (ASR) and related tasks. We combine the use of subword features (letter ngrams) and one-hot encoding of frequent words so that the models can handle large vocabularies containing infrequent words. We propose a new objective function that allows for training of unnormalized probabilities. An importance sampling based method is supported to speed up training when the vocabulary is large.


In this paper, we present a pruning technique for maximum en- tropy (MaxEnt) language models. It is based on computing the exact entropy loss when removing each feature from the model, and it ex- plicitly supports backoff features by replacing each removed feature with its backoff. The algorithm computes the loss on the training data, so it is not restricted to models with n-gram like features, al- lowing models with any feature, including long range skips, triggers, and contextual features such as device location.