ICASSP 2022

ICASSP 2022 - IEEE International Conference on Acoustics, Speech and Signal Processing is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The ICASSP 2022 conference will feature world-class presentations by internationally renowned speakers, cutting-edge session topics and provide a fantastic opportunity to network with like-minded professionals from around the world. Visit the website.

DISCOURSE-LEVEL PROSODY MODELING WITH A VARIATIONAL AUTOENCODER FOR NON-AUTOREGRESSIVE EXPRESSIVE SPEECH SYNTHESIS

To address the issue of one-to-many mapping from phoneme sequences to acoustic features in expressive speech synthesis, this paper proposes a method of discourse-level prosody modeling with a variational autoencoder (VAE) based on the non-autoregressive architecture of FastSpeech. In this method, phone-level prosody codes are extracted from prosody features by combining VAE with FastSpeech, and are predicted using discourse-level text features together with BERT embeddings. The continuous wavelet transform (CWT) in FastSpeech2 for F0 representation is not necessary anymore.

ppt.pptx

presentation slides (238)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

32 Views

Tackling Data Scarcity in Speech Translation Using Zero-Shot Multilingual Machine Translation Techniques

Recently, end-to-end speech translation (ST) has gained significant attention as it avoids error propagation. However, the approach suffers from data scarcity. It heavily depends on direct ST data and is less efficient in making use of speech transcription and text translation data, which is often more easily available. In the related field of multilingual text translation, several techniques have been proposed for zero-shot translation. A main idea is to increase the similarity of semantically similar sentences in different languages.

MultiModalST-ICASSP2022-Slides.pdf

Presentation slides (276)

Categories:: Machine Translation of Speech (SLP-SSMT)

62 Views

Speech Recognition Using Biologically-Inspired Neural Networks

Read more about Speech Recognition Using Biologically-Inspired Neural Networks
Log in to post comments

Automatic speech recognition systems (ASR), such as the recurrent neural network transducer (RNN-T), have reached close to human-like performance and are deployed in commercial applications. However, their core operations depart from the powerful biological counterpart, the human brain. On the other hand, the current developments in biologically-inspired ASR models lag behind in terms of accuracy and focus primarily on small-scale applications.

Poster_page1.pdf

Poster_page1.pdf (262)

Categories:: General Topics in Speech Recognition (SPE-GASR)

18 Views

Towards Robust Visual Transformer Networks via K-Sparse Attention

Read more about Towards Robust Visual Transformer Networks via K-Sparse Attention
Log in to post comments

Transformer networks, originally developed in the community of machine translation to eliminate sequential nature of recurrent neural networks, have shown impressive results in other natural language processing and machine vision tasks. Self-attention is the core module behind visual transformers which globally mixes the image information. This module drastically reduces the intrinsic inductive bias imposed by CNNs, such as locality, while encountering insufficient robustness against some adversarial attacks.

ICASSP2022_Presentation.pdf

ICASSP2022_Presentation.pdf (271)

Categories:: Neural network learning (MLR-NNLR)

35 Views

THE EFFECT OF PARTIAL TIME-FREQUENCY MASKING OF THE DIRECT SOUND ON THE PERCEPTION OF REVERBERANT SPEECH

The perception of sound in real-life acoustic environments, such as enclosed rooms or open spaces with reflective objects, is affected by reverberation. Hence, reverberation is extensively studied in the context of auditory perception, with many studies highlighting the importance of the direct sound for perception. Based on this insight, speech processing methods often use time-frequency (TF) analysis to detect TF bins that are dominated by the direct sound, and then use the detected bins to reproduce or enhance the speech signals.

ICASSP 2022 - poster shorter.pdf

Poster presented at ICASSP 2022 (243)

Categories:: Speech Perception and Psychoacoustics (SPE-SPER)

12 Views

Customer Satisfaction Estimation using Unsupervised Representation Learning with Multi-Format Prediction Loss

2204_ICASSP22_CSE_Unsupervised_v3.pdf

2204_ICASSP22_CSE_Unsupervised_v3.pdf (322)

Categories:: Speech Processing

15 Views

Deep Hashing With Hash Center Update for Efficient Image Retrieval

Read more about Deep Hashing With Hash Center Update for Efficient Image Retrieval
1 comment
Log in to post comments

In this paper, we propose an approach for learning binary hash codes
for image retrieval. Canonical Correlation Analysis (CCA) is used
to design two loss functions for training a neural network such that
the correlation between the two views to CCA is maximum. The
main motivation for using CCA for feature space learning is that
dimensionality reduction is possible and short binary codes could
be generated. The first loss maximizes the correlation between the
hash centers and the learned hash codes. The second loss maximizes

4514-2.pdf

4514-2.pdf (271)

Categories:: Multimodal signal processing

16 Views

LEARNING SPARSE GRAPHS WITH A CORE-PERIPHERY STRUCTURE

Read more about LEARNING SPARSE GRAPHS WITH A CORE-PERIPHERY STRUCTURE
Log in to post comments

In this paper, we focus on learning sparse graphs with a core-periphery structure. We propose a generative model for data associated with core-periphery structured networks to model the dependence of node attributes on core scores of the nodes of a graph through a latent graph structure. Using the proposed model, we jointly infer a sparse graph and nodal core scores that induce dense (sparse) connections in core (respectively, peripheral) parts of the network.

icassp_22_poster_v2.pdf

Poster (231)

icassp_2022_slides_v2.pdf

Presentation Slides (241)

Categories:: Machine Learning for Signal Processing

14 Views

MTAF: SHOPPING GUIDE MICRO-VIDEOS POPULARITY PREDICTION USING MULTIMODAL AND TEMPORAL ATTENTION FUSION APPROACH

Predicting the popularity of shopping guide micro-videos incorporating merchandise is crucial for online advertising. What are the significant factors affecting the popularity of the micro-video? How to extract and effectively fuse multiple modalities for the micro-video popularity prediction? This is a question that needs to be urgently answered to better provide insights for advertisers. In this paper, we propose a Multimodal and Temporal Attention Fusion (MTAF) framework to represent and combine multi-modal features.

icassp2022-8471-ppt.pdf

icassp2022-8471-ppt.pdf (246)

Categories:: Machine Learning for Signal Processing

30 Views

PSEUDO-LABEL TRANSFER FROM FRAME-LEVEL TO NOTE-LEVEL IN A TEACHER-STUDENT FRAMEWORK FOR SINGING TRANSCRIPTION FROM POLYPHONIC MUSIC

Lack of large-scale note-level labeled data is the major obstacle to singing transcription from polyphonic music. We address the issue by using pseudo labels from vocal pitch estimation models given unlabeled data. The proposed method first converts the frame-level pseudo labels to note-level through pitch and rhythm quantization steps. Then, it further improves the label quality through self-training in a teacher-student framework.

poster_ICASSP22.pdf

poster_ICASSP22.pdf (264)

presentation_ICASSP22.pdf

presentation_ICASSP22.pdf (346)

Categories:: Music Signal Processing

22 Views

Pages