![](https://sigport.org/sites/default/files/styles/medium/public/icassp24_logo.jpg?itok=fMlObe3v)
IEEE ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The IEEE ICASSP 2024 conference will feature world-class presentations by internationally renowned speakers, cutting-edge session topics and provide a fantastic opportunity to network with like-minded professionals from around the world. Visit the website.
![](https://sigport.org/sites/default/files/styles/home/public/icassp24_logo_0.jpg?itok=OGpw2wC4)
- Read more about Tunisian Code Switched ASR Presentation
- Log in to post comments
Crafting an effective Automatic Speech Recognition (ASR) solution for dialects demands innovative approaches that not only address the data scarcity issue but also navigate the intricacies of linguistic diversity. In this paper, we address the aforementioned ASR challenge, focusing on the Tunisian dialect. First, textual and audio data is collected and in some cases annotated.
- Categories:
![](https://sigport.org/sites/default/files/styles/home/public/icassp24_logo_0.jpg?itok=OGpw2wC4)
- Read more about TALKNCE: IMPROVING ACTIVE SPEAKER DETECTION WITH TALK-AWARE CONTRASTIVE LEARNING
- Log in to post comments
The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames.
Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored.
In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full segments where a person on the screen is actually speaking.
- Categories:
![](https://sigport.org/sites/default/files/styles/home/public/icassp24_logo_0.jpg?itok=OGpw2wC4)
- Read more about SCORE-BASED DIFFUSION MODELS FOR PHOTOACOUSTIC TOMOGRAPHY IMAGE RECONSTRUCTION
- Log in to post comments
Photoacoustic tomography (PAT) is a rapidly-evolving medical imaging modality that combines optical absorption contrast with ultrasound imaging depth. One challenge in PAT is image reconstruction with inadequate acoustic signals due to limited sensor coverage or due to the density of the transducer array. Such cases call for solving an ill-posed inverse reconstruction problem. In this work, we use score-based diffusion models to solve the inverse problem of reconstructing an image from limited PAT measurements.
- Categories:
![](https://sigport.org/sites/default/files/styles/home/public/icassp24_logo_0.jpg?itok=OGpw2wC4)
- Read more about MUSICLDM: ENHANCING NOVELTY IN TEXT-TO-MUSIC GENERATION USING BEAT-SYNCHRONOUS MIXUP STRATEGIES
- Log in to post comments
Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain.
- Categories:
![](https://sigport.org/sites/default/files/styles/home/public/icassp24_logo_0.jpg?itok=OGpw2wC4)
- Read more about Channel Estimation in Underdetermined Systems Utilizing Variational Autoencoders
- Log in to post comments
In this work, we propose to utilize a variational autoencoder (VAE) for channel estimation (CE) in underdetermined (UD) systems. The basis of the method forms a recently proposed concept in which a VAE is trained on channel state information (CSI) data and used to parameterize an approximation to the mean squared error (MSE)-optimal estimator. The contributions in this work extend the existing framework from fully-determined (FD) to UD systems, which are of high practical relevance.
- Categories:
![](https://sigport.org/sites/default/files/styles/home/public/icassp24_logo_0.jpg?itok=OGpw2wC4)
- Read more about MDX-GAN: ENHANCING PERCEPTUAL QUALITY IN MULTI-CLASS SOURCE SEPARATION VIA ADVERSARIAL TRAINING
- Log in to post comments
Audio source separation aims to extract individual sound sources from an audio mixture. Recent studies on source separation focus primarily on minimizing signal-level distance, typically measured by source-to-distortion ratio (SDR). However, scant attention has been given to the perceptual quality of the separated tracks. In this paper, we propose MDX-GAN, an efficient and high-fidelity audio source separator based on MDX-Net for multiple sound classes. We leverage different training objectives to enhance the perceptual quality of audio source separation.
- Categories:
![](https://sigport.org/sites/default/files/styles/home/public/icassp24_logo_0.jpg?itok=OGpw2wC4)
![](https://sigport.org/sites/default/files/styles/home/public/icassp24_logo_0.jpg?itok=OGpw2wC4)
- Read more about DIFFUSION-BASED SPEECH ENHANCEMENT IN MATCHED AND MISMATCHED CONDITIONS USING A HEUN-BASED SAMPLER
- Log in to post comments
Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on the particular databases. Moreover, recent developments from the image generation literature remain largely unexplored for speech enhancement.
- Categories:
![](https://sigport.org/sites/default/files/styles/list/public/poster-final.jpg?itok=Y03Bbz49)
- Read more about SPIKING STRUCTURED STATE SPACE MODEL FOR MONAURAL SPEECH ENHANCEMENT
- Log in to post comments
Speech enhancement seeks to extract clean speech from noisy signals. Traditional deep learning methods face two challenges: efficiently using information in long speech sequences and high computational costs. To address these, we introduce the Spiking Structured State Space Model (Spiking-S4). This approach merges the energy efficiency of Spiking Neural Networks (SNN) with the long-range sequence modeling capabilities of Structured State Space Models (S4), offering a compelling solution.
- Categories:
![](https://sigport.org/sites/default/files/styles/home/public/icassp24_logo_0.jpg?itok=OGpw2wC4)
- Read more about TD-GPT: Target Protein-Specific Drug Molecule Generation GPT
- Log in to post comments
Drug discovery faces challenges due to the vast chemical space and complex drug-target interactions. This paper proposes a novel deep learning framework TD-GPT for targeted drug molecule generation. TD-GPT comprises a linear Transformer for drug-target affinity prediction, an affinity-enhanced protein encoder using sequences, and a target-specific attention module in the molecular Transformer decoder. Experiments demonstrate TD-GPT’s efficiency in generating valid, novel molecules with high affinity and specificity for desired targets without target fine-tuning.
- Categories: