Sorry, you need to enable JavaScript to visit this website.

ICASSP 2021 - IEEE International Conference on Acoustics, Speech and Signal Processing is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The ICASSP 2021 conference will feature world-class presentations by internationally renowned speakers, cutting-edge session topics and provide a fantastic opportunity to network with like-minded professionals from around the world. Visit website.

Recent research shows end-to-end ASR systems can recognize overlapped speech from multiple speakers. However, all published works have assumed no latency constraints during inference, which does not hold for most voice assistant inter- actions. This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T) that has been shown to provide high recognition accuracy at a low latency online recognition regime.


Reverberation time, T60, directly influences the amount of reverberation
in a signal, and its direct estimation may help with
dereverberation. Traditionally, T60 estimation has been done
using signal processing or probabilistic approaches, until recently
where deep-learning approaches have been developed.
Unfortunately, the appropriate loss function for training the
network has not been adequately determined. In this paper,
we propose a composite classification- and regression-based


Nowadays living environments are characterized by networks of inter-connected sensing devices that accomplish different tasks, e.g., video-surveillance of an environment by a network of CCTV cameras. A malicious user could gather sensitive details on people’s activities by eavesdropping the exchanged data packets. To overcome this problem,video streams are protected by encryption systems, but even secured channels may still leak some information.


Diversity smoothing has been widely developed for angle estimation with bistatic multiple input multiple output (MIMO) radar in the presence of coherent targets, the parameter identifiability of which is an important issue. In this paper, we are devoted to establishing more accurate conditions by studying the positive definiteness of smoothed target covariance matrix. The antenna numbers of transmit and receive arrays are derived as functions of the target number and target structure. We show that the new results improve upon previous ones and recover them in special cases.


We propose a generalized thinned coprime array by introducing the flexible inter-element spacings, where the conventional one can be seen as a special case. We derive closed-form expression for the range of consecutive lags, written as the functions of the antenna numbers and inter-element spacings. We show that, after optimization, the proposed array can achieve more consecutive lags than the other coprime arrays. In particular, the optimized results also provide the minimum number of antenna pairs with small separation.


End-to-end acoustic speech recognition has quickly gained widespread popularity and shows promising results in many studies. Specifically the joint transformer/CTC model provides very good performance in many tasks. However, under noisy and distorted conditions, the performance still degrades notably. While audio-visual speech recognition can significantly improve the recognition rate of end-to-end models in such poor conditions, it is not obvious how to best utilize any available information on acoustic and visual signal quality and reliability in these models.


Wireless channels are considered that change over time but remain constant for a certain (coherence) period. This behavior is perfectly captured by block fading channels and affects the performance of the corresponding wireless communication systems. Desired closed-form characterizations of optimal transmission schemes remain unknown in many cases. This paper approaches this issue from a fundamental, algorithmic point of view by studying whether or not it is in principle possible to construct or find such optimal transmission


We investigate a set of techniques for RNN Transducers (RNN-Ts) that were instrumental in lowering the word error rate on three different tasks (Switchboard 300 hours, conversational Spanish 780 hours and conversational Italian 900 hours). The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe. First, we introduce a novel multiplicative integration of the encoder and prediction network vectors in the joint network (as opposed to additive).


Wireless communication systems are inherently vulnerable to adversarial attacks since malevolent jammers might jam and disrupt the legitimate transmission intentionally. Of particular interest are so-called denial-of-service (DoS) attacks in which the jammer is able to completely disrupt the communication. Accordingly, it is of crucial interest for the legitimate users to detect such DoS attacks. Turing machines provide the fundamental limits of today’s digital computers and therewith of the traditional signal processing. It has been

