Sorry, you need to enable JavaScript to visit this website.

IEEE ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The ICASSP 2023 conference will feature world-class presentations by internationally renowned speakers, cutting-edge session topics and provide a fantastic opportunity to network with like-minded professionals from around the world. Visit the website.

Passive acoustic monitoring (PAM) is a nonintrusive approach to studying behaviors of vocalizing marine organisms underwater that otherwise would remain unexplored. In this paper, we propose a data processing chain that can detect and track multiple whales in 3-D from passively recorded underwater acoustic signals. In particular, time-difference-of-arrival (TDOA) measurements of echolocation clicks are extracted from a volumetric hydrophone array's acoustic data by using a noise-whitening cross-correlation.


Automatic Synthesizer Programming is the task of transforming an audio signal that was generated from a virtual instrument, into the parameters of a sound synthesizer that would generate this signal. In the past, this could only be done for one virtual instrument. In this paper, we expand the current literature by exploring approaches to automatic synthesizer programming for multiple virtual instruments. Two different approaches to multi-task automatic synthesizer programming are presented. We find that the joint-decoder approach performs best.


Generative deep neural networks are widely used for speech synthesis, but most existing models directly generate waveforms or spectral outputs. Humans, however, produce speech by controlling articulators, which results in the production of speech sounds through physical properties of sound propagation. We introduce the Articulatory Generator to the Generative Adversarial Network paradigm, a new unsupervised generative model of speech production/synthesis.


In this work we present Low-rank Deconvolution, a powerful framework for low-level feature-map learning for efficient signal representation with application to signal recovery. Its formulation in multi-linear algebra inherits properties from convolutional sparse coding and low-rank approximation methods as in this setting signals are decomposed in a set of filters convolved with a set of low-rank tensors. We show its advantages by learning compressed video representations and solving image in-painting problems.


Audio-visual learning helps to comprehensively understand the world by fusing practical information from multiple modalities. However, recent studies show that the imbalanced optimization of uni-modal encoders in a joint-learning model is a bottleneck to enhancing the model`s performance. We further find that the up-to-date imbalance-mitigating methods fail on some audio-visual fine-grained tasks, which have a higher demand for distinguishable feature distribution.


This work proposes a model for continual learning on tasks involving temporal sequences, specifically, human motions. It improves on a recently proposed brain-inspired replay model (BI-R) by building a biologically-inspired conditional temporal variational autoencoder (BI-CTVAE), which instantiates a latent mixture-of-Gaussians for class representation. We investigate a novel continual-learning-to-generate (CL2Gen) scenario where the model generates motion sequences of different classes. The generative accuracy of the model is tested over a set of tasks.


Echo and background noise are the major obstacles in today’s user sound experience for devices like a speakerphone or video bar. We propose real-time perceptually motivated neural network-based echo control and noise reduction. The demonstrated method relies on a linear acoustic echo canceller (LAEC) combined with a neural network as a post-filter which incorporates perceptual mapping in both feature representation and loss function. The proposed method relies on mic and far-end signals for the LAEC stage, while the LAEC output, mic and echo estimate are inputs to the post-filter.


The fusion of multiple probability densities has important applications in many fields, including, for example, multi-sensor signal pro- cessing, robotics, and smart environments. In this paper, we demonstrate that deep learning-based methods can be used to fuse multi-object densities. Given a scenario with several sensors with possibly different field-of-views, tracking is performed locally in each sensor by a tracker, which produces random finite set multi-object densities.


The paper introduces the jazznet Dataset, a dataset of fundamental jazz piano music patterns for developing machine learning (ML) algorithms in music information retrieval (MIR). The dataset contains 162520 labeled piano patterns, including chords, arpeggios, scales, and chord progressions with their inversions, resulting in more than 26k hours of audio and a total size of 95GB.