IEEE ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The IEEE ICASSP 2024 conference will feature world-class presentations by internationally renowned speakers, cutting-edge session topics and provide a fantastic opportunity to network with like-minded professionals from around the world. Visit the website.
- Read more about MOTION TRANSFER-DRIVEN INTRA-CLASS DATA AUGMENTATION FOR FINGER VEIN RECOGNITION
- Log in to post comments
Finger vein recognition (FVR) has emerged as a secure biometric technique because of the confidentiality of vascular bio-information. Recently, deep learning-based FVR has gained increased popularity and achieved promising performance. However, the limited size of public vein datasets has caused overfitting issues and greatly limits the recognition performance.
- Categories:
- Read more about Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study
- Log in to post comments
In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.3% on YouTube captioning.
- Categories:
- Read more about Multicast Transmission Design With Enhanced DOF For Mimo Coded Caching Systems
- Log in to post comments
Integrating coded caching (CC) into multi-input multi-output (MIMO) setups significantly enhances the achievable degrees of freedom (DoF). We consider a cache-aided MIMO configuration with a CC gain t, where a server with L Tx-antennas communicates with K users, each equipped with G Rx-antennas. Similar to existing works, we also extend a core CC approach, designed initially for multi-input single-output (MISO) scenarios, to the MIMO setup.
- Categories:
- Read more about SGT: SELF-GUIDED TRANSFORMER FOR FEW-SHOT SEMANTIC SEGMENTATION
- Log in to post comments
For the few-shot segmentation (FSS) task, existing methods
attempt to capture the diversity of new classes by fully uti-
lizing the limited support images, such as cross-attention and
prototype matching. However, they often overlook the fact
that there is variability in different regions of the same ob-
ject, and intra-image similarity is higher than inter-image sim-
ilarity.To address these limitations, a Self-Guided Trans-
former (SGT) is proposed by leveraging intra-image similar-
- Categories:
- Read more about Scaling NVIDIA’s Multi-Speaker Multi-Lingual TTS Systems with Zero-Shot TTS to Indic Languages
- Log in to post comments
In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions.
- Categories:
- Read more about Learning Graphs and Simplicial Complexes from Data
- Log in to post comments
Graphs are widely used to represent complex information and signal domains with irregular support. Typically, the underlying graph topology is unknown and must be estimated from the available data. Common approaches assume pairwise node interactions and infer the graph topology based on this premise. In contrast, our novel method not only unveils the graph topology but also identifies three-node interactions, referred to in the literature as second-order simplicial complexes (SCs).
- Categories:
Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoPE Transformer (called BS-RoFormer).
- Categories:
- Read more about LEVERAGING EFFECTIVE LANGUAGE AND SPEAKER CONDITIONING IN INDIC TTS FOR LIMMITS 2024 CHALLENGE
- Log in to post comments
In this paper, we explain the model that was developed by the NLP\_POSTECH team for the LIMMITS 2024 Grand Challenge. Among the three tracks, we focus on Track 1, which necessitates the creation of a few-shot text-to-speech (TTS) system that generates natural speech across diverse languages. Towards this end, to realize multi-lingual capability, we incorporate a learnable language embedding. In addition, for precise imitation of target speaker voices, we leverage an inductive speaker bias conditioning methodology.
- Categories:
- Read more about GENERATING PERSONA-AWARE EMPATHETIC RESPONSES WITH RETRIEVAL-AUGMENTED PROMPT LEARNING
- 2 comments
- Log in to post comments
Empathetic response generation requires perceiving and un- derstanding the user’s emotion to deliver a suitable response. However, existing models generally remain oblivious of an interlocutor’s persona, which has been shown to play a vital role in expressing appropriate empathy to different users. To address this problem, we propose a novel Transformer-based architecture that incorporates retrieval-augmented prompt learning to generate persona-aware empathetic responses.
- Categories:
- Read more about TEN-GUARD: TENSOR DECOMPOSITION FOR BACKDOOR ATTACK DETECTION IN DEEP NEURAL NETWORKS
- Log in to post comments
As deep neural networks and the datasets used to train them get larger, the default approach to integrating them into re-
search and commercial projects is to download a pre-trained model and fine tune it. But these models can have uncertain
provenance, opening up the possibility that they embed hidden malicious behavior such as trojans or backdoors, where
small changes to an input (triggers) can cause the model toproduce incorrect outputs (e.g., to misclassify). This paper
- Categories: