ICASSP 2021

ICASSP 2021 - IEEE International Conference on Acoustics, Speech and Signal Processing is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The ICASSP 2021 conference will feature world-class presentations by internationally renowned speakers, cutting-edge session topics and provide a fantastic opportunity to network with like-minded professionals from around the world. Visit website.

Towards an ASR Approach Using Acoustic and Language Models for Speech Enhancement

Read more about Towards an ASR Approach Using Acoustic and Language Models for Speech Enhancement
Log in to post comments

Recent work has shown that deep-learning based speech enhancement performs best when a time-frequency mask is estimated. Unlike speech, these masks have a small range of values that better facilitate regression-based learning. The question remains whether neural-network based speech estimation should be treated as a regression problem. In this work, we propose to modify the speech estimation process, by treating speech enhancement as a classification problem in an ASR-style manner.

nayem_ICASSP2021_final.pdf

Presentation slides of ASR style quantized spectral model based SE, presented at ICASSP 2021. (162)

Categories:: Speech Enhancement (SPE-ENHA)

42 Views

Image coding for machines: an end-to-end learned approach

Read more about Image coding for machines: an end-to-end learned approach
Log in to post comments

Over recent years, deep learning-based computer vision systems have been applied to images at an ever-increasing pace, oftentimes representing the only type of consumption for those images. Given the dramatic explosion in the number of images generated per day, a question arises: how much better would an image codec targeting machine-consumption perform against state-of-the-art codecs targeting human-consumption? In this paper, we propose an image codec for machines which is neural network (NN) based and end-to-end learned.

ICASSP-Poster.pdf

ICASSP-Poster.pdf (464)

Categories:: Image/Video Coding

4 Views

Subjective and objective evaluation of deepfake videos

Read more about Subjective and objective evaluation of deepfake videos
Log in to post comments

korshunov-3955.pdf

Poster (135)

deepfakes-humans-machines.pdf

Presentation slides (151)

Categories:: Multimedia Forensics

12 Views

Processing pipelines for efficient, physically-accurate simulation of microphone array signals in dynamic sound scenes

Multichannel acoustic signal processing is predicated on the fact that the interchannel relationships between the received signals can be exploited to infer information about the acoustic scene. Recently there has been increasing interest in algorithms which are applicable in dynamic scenes, where the source(s) and/or microphone array may be moving. Simulating such scenes has particular challenges which are exacerbated when real-time, listener-in-the-loop evaluation of algorithms is required.

icassp-2021-pipelines_final.pdf

Presentation (183)

ICASSP_2021_corrected_with_watermark.pdf

Corrected manuscript (163)

Categories:: Audio and Acoustic Signal Processing

8 Views

CNR-IEMN: a deep learning based approach to recognise Covid-19 from CT-scan

Read more about CNR-IEMN: a deep learning based approach to recognise Covid-19 from CT-scan
Log in to post comments

SPGC_posterf.pptx

Poster (122)

Categories:: Pattern recognition and classification (MLR-PATT)

11 Views

TCLA Array: A New Sparse Array Design with Less Mutual Coupling

Read more about TCLA Array: A New Sparse Array Design with Less Mutual Coupling
1 comment
Log in to post comments

TCLA_array_Slides_ICASSP_2021.pdf

TCLA array presentation slides - ICASSP 2021 (138)

Categories:: Sensor Array Processing

11 Views

JOINT MASKED CPC AND CTC TRAINING FOR ASR

Read more about JOINT MASKED CPC AND CTC TRAINING FOR ASR
Log in to post comments

Self-supervised learning (SSL) has shown promise in learning representations of audio that are useful for automatic speech recognition (ASR). But, training SSL models like wav2vec~2.0 requires a two-stage pipeline. In this paper we demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data. During training, we alternately minimize two losses: an unsupervised masked Contrastive Predictive Coding (CPC) loss and the supervised audio-to-text alignment loss Connectionist Temporal Classification (CTC).

4384_Joint_CPC_CTC_poster_ICASSP2021.pdf

Poster (179)

4384_Joint_CPC_CTC_presentation_ICASSP2021.pdf

Presentation (216)

Categories:: Audio and Acoustic Signal Processing

120 Views

ATTACK ON PRACTICAL SPEAKER VERIFICATION SYSTEM USING UNIVERSAL ADVERSARIAL PERTURBATIONS

5375slide.pdf

slides (148)

Categories:: Applications
Speaker Recognition and Characterization (SPE-SPKR)

13 Views

FastDCTTS: Efficient Deep Convolutional Text-to-Speech

Read more about FastDCTTS: Efficient Deep Convolutional Text-to-Speech
Log in to post comments

We propose an end-to-end speech synthesizer, Fast DCTTS, that synthesizes speech in real time on a single CPU thread. The proposed model is composed of a carefully-tuned lightweight network designed by applying multiple network reduction and fidelity improvement techniques. In addition, we propose a novel group highway activation that can compromise between computational efficiency and the regularization effect of the gating mechanism. As well, we introduce a new metric called Elastic mel-cepstral distortion (EMCD) to measure the fidelity of the output mel-spectrogram.

IEEE-ICASSP2021_FastDCTTS(4829)_final_wo_video.pdf

FastDCTTS presentation slide (124)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

35 Views

Federated Learning With Local Differential Privacy: Trade-Offs Between Privacy, Utility, and Communication

Federated learning (FL) allows to train a massive amount of data privately due to its decentralized structure. Stochastic gradient descent (SGD) is commonly used for FL due to its good empirical performance, but sensitive user information can still be inferred from weight updates shared during FL iterations. We consider Gaussian mechanisms to preserve local differential privacy (LDP) of user data in the FL model with SGD. The trade-offs between user privacy, global utility, and transmission rate are proved by defining appropriate metrics for FL with LDP.