Sorry, you need to enable JavaScript to visit this website.

ICASSP 2021 - IEEE International Conference on Acoustics, Speech and Signal Processing is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The ICASSP 2021 conference will feature world-class presentations by internationally renowned speakers, cutting-edge session topics and provide a fantastic opportunity to network with like-minded professionals from around the world. Visit website.

Hypothesis-level combination between multiple models can often yield gains in speech recognition. However, all models in the ensemble are usually restricted to use the same audio segmentation times. This paper proposes to generalise hypothesis-level combination, allowing the use of different audio segmentation times between the models, by splitting and re-joining the hypothesised N-best lists in time. A hypothesis tree method is also proposed to distribute hypothesis posteriors among the constituent words, to facilitate such splitting when per-word scores are not available.

Categories:
7 Views

Due to the wide use of multi-sensor technology, analysis of multiple sets of data is at the heart of many challenging engineering problems. Independent vector analysis (IVA), a recent generalization of independent component analysis (ICA), enables the joint analysis of datasets and extraction of latent sources through the use of a simple yet effective generative model. However, the success of IVA is tied to proper estimation of the probability density function (PDF) of the multivariate latent sources; information that is generally unknown.

Categories:
86 Views

Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation.

Categories:
12 Views

We propose a novel pitch estimation technique called DeepF0, which leverages the available annotated data to directly learns from the raw audio in a data-driven manner. F0 estimation is important in various speech processing and music information retrieval applications. Existing deep learning models for pitch estimations have relatively limited learning capabilities due to their shallow receptive field. The proposed model addresses this issue by extending the receptive field of a network by introducing the dilated convolutional blocks into the network.

Categories:
120 Views

Aerial image classification is challenging for current deep learning models due to the varied geo-spatial object scales and the complicated scene spatial arrangement. Thus, it is necessary to stress the key local feature response from a variety of scales so as to represent discriminative convolutional features. In this paper, we propose a deep multi-scale multiple instance learning (DMSMIL) framework to tackle the above challenges. Firstly, we develop a differential multi-scale dilated convolution feature extractor to exploit the different patterns from different scales.

Categories:
15 Views

Pages