Sorry, you need to enable JavaScript to visit this website.

The paper introduces a hierarchy-aware loss function in a Deep Neural Network for an audio event detection task that has a bi-level tree structured label space. The goal is not only to improve audio event detection performance at all levels in the label hierarchy, but also to produce better audio embeddings. We exploit the label tree structure to preserve that information in the hierarchy-aware loss function. Two different loss functions are separately employed. First, a triplet loss with probabilistic multi-level batch mining is introduced.


The cloud-based speech recognition/API provides developers or enterprises an easy way to create speech-enabled features in their applications. However, sending audios about personal or company internal information to the cloud, raises concerns about the privacy and security issues. The recognition results generated in cloud may also reveal some sensitive information. This paper proposes a deep polynomial network (DPN) that can be applied to the encrypted speech as an acoustic model. It allows clients to send their data in an encrypted form to the cloud to ensure that their data remains confidential, at mean while the DPN can still make frame-level predictions over the encrypted speech and return them in encrypted form. One good property of the DPN is that it can be trained on unencrypted speech features in the traditional way. To keep the cloud away from the raw audio and recognition results, a cloud-local joint decoding framework is also proposed. We demonstrate the effectiveness of model and framework on the Switchboard and Cortana voice assistant tasks with small performance degradation and latency increased comparing with the traditional cloud-based DNNs.


In this paper we propose speaker characterization using time delay neural networks and long short-term memory neural networks (TDNN-LSTM) speaker embedding. Three types of front-end feature extraction are investigated to find good features for speaker embedding. Three kinds of data augmentation are used to increase the amount and diversity of the training data. The proposed methods are evaluated with the National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) tasks.


360 camera has recently become popular since it can capture the whole 360 scene. A large number of related applications have been springing up. In this paper, We propose a deep learning based object detector that can be applied directly on 360 images. The proposed detector is based on modifications of the faster RCNN model. Three modification schemes are proposed here, including (1) distortion data augmentation, (2) introducing muilti-kernel layers for improving accuracy for distorted object detection, and (3) adding position information into the model for learning spatial information.


Lattice decoders constructed with neural networks are presented.
Firstly, we show how the fundamental parallelotope
is used as a compact set for the approximation by a neural lattice
decoder. Secondly, we introduce the notion of Voronoi reduced
lattice basis. As a consequence, a first optimal neural
lattice decoder is built from Boolean equations and the facets
of the Voronoi cell. This decoder needs no learning. Finally,
we present two neural decoders with learning. It is shown


Here, a novel approach is proposed to generate age progression (i.e., future looks) and regression (i.e., previous looks) of persons based on their face images. The proposed method addresses face aging as an unsupervised image-to-image translation problem where the goal is to translate a face image belonging to an age class to an image of

