Audio and Acoustic Signal Processing

VOICEFLOW: EFFICIENT TEXT-TO-SPEECH WITH RECTIFIED FLOW MATCHING

Read more about VOICEFLOW: EFFICIENT TEXT-TO-SPEECH WITH RECTIFIED FLOW MATCHING
Log in to post comments

Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated.

slides.pptx

slides.pptx (157)

Categories:: Audio and Acoustic Signal Processing

7 Views

GTCRN: A Speech Enhancement Model Requiring Ultralow Computational Resources

Read more about GTCRN: A Speech Enhancement Model Requiring Ultralow Computational Resources
Log in to post comments

While modern deep learning-based models have significantly outperformed traditional methods in the area of speech enhancement, they often necessitate a lot of parameters and extensive computational power, making them impractical to be deployed on edge devices in real-world applications. In this paper, we introduce Grouped Temporal Convolutional Recurrent Network (GTCRN), which incorporates grouped strategies to efficiently simplify a competitive model, DPCRN. Additionally, it leverages subband feature extraction modules and temporal recurrent attention modules to enhance its performance.

GTCRN_poster.pdf

GTCRN_poster.pdf (1867)

Categories:: Audio and Acoustic Signal Processing

268 Views

Adversarial Continual Learning to Transfer Self-Supervised Speech Representations for Voice Pathology Detection

In recent years, voice pathology detection (VPD) has received considerable attention because of the increasing risk of voice problems. Several methods, such as support vector machine and convolutional neural network-based models, achieve good VPD performance. To further improve the performance, we use a self-supervised pretrained model as feature representation instead of explicit speech features. When the pretrained model is fine-tuned for VPD, an overfitting problem occurs due to a domain shift from conversation speech to the VPD task.

2024_ICASSP_Poster_v1.pdf

2024_ICASSP_Poster_v1.pdf (186)

Categories:: Audio and Acoustic Signal Processing

25 Views

DATA DRIVEN GRAPHEME-TO-PHONEME REPRESENTATIONS FOR A LEXICON-FREE TEXT-TO-SPEECH

Read more about DATA DRIVEN GRAPHEME-TO-PHONEME REPRESENTATIONS FOR A LEXICON-FREE TEXT-TO-SPEECH
Log in to post comments

Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high.

20240118060700_952384_4931.pdf

paper (199)

Categories:: Audio and Acoustic Signal Processing

74 Views

DATA DRIVEN GRAPHEME-TO-PHONEME REPRESENTATIONS FOR A LEXICON-FREE TEXT-TO-SPEECH

Read more about DATA DRIVEN GRAPHEME-TO-PHONEME REPRESENTATIONS FOR A LEXICON-FREE TEXT-TO-SPEECH
Log in to post comments

20240118060700_952384_4931.pdf

paper (141)

Categories:: Audio and Acoustic Signal Processing

34 Views

Small-Footprint Convolutional Neural Network with reduced feature map for Voice Activity Detection

By using Voice Activity Detection (VAD) as a preprocessing step, hardware-efficient implementations are possible for speech applications that need to run continuously in severely resource-constrained environments. For this purpose, we propose TinyVAD, which is a new convolutional neural network(CNN) model that executes extremely efficiently with a small memory footprint. TinyVAD uses an input pixel matrix partitioning method, termed patchify, to downscale the resolution of the input spectrogram.

Poster_Presentation_chb_v1.pptx

Poster_Presentation_chb_v1.pptx (208)

Categories:: Audio and Acoustic Signal Processing

137 Views

A Flexible Framework for Expectation Maximization-Based MIMO System Identification for Time-Variant Linear Acoustic Systems

Quasi-continuous system identification of time-variant linear acoustic systems can be applied in various audio signal processing applications when numerous acoustic transfer functions must be measured. A prominent application is measuring head-related transfer functions. We treat the underlying multiple-input-multiple-output (MIMO) system identification problem in a state-space model as a joint estimation problem for states, representing impulse responses, and state-space model parameters using the expectation maximization (EM) algorithm.

poster_11602.pdf

poster_11602.pdf (176)

Categories:: Audio and Acoustic Signal Processing

30 Views

ZERO SHOT AUDIO TO AUDIO EMOTION TRANSFER WITH SPEAKER DISENTANGLEMENT

Read more about ZERO SHOT AUDIO TO AUDIO EMOTION TRANSFER WITH SPEAKER DISENTANGLEMENT
Log in to post comments

The problem of audio-to-audio (A2A) style transfer involves replacing the style features of the source audio with those from the target audio while preserving the content related attributes of the source audio. In this paper, we propose an efficient approach, termed as Zero-shot Emotion Style Transfer (ZEST), that allows the transfer of emotional content present in the given source audio with the one embedded in the target audio while retaining the speaker and speech content from the source.

2401.04511.pdf

2401.04511.pdf (268)

Categories:: Audio and Acoustic Signal Processing

121 Views

Nkululeko

Read more about Nkululeko
Log in to post comments

We would like to present Nkululeko, a template based system that lets users perform machine learning experiments in the speaker characteristics domain. It is mainly targeted on users not being familiar with machine learning, or computer programming at all, to being used as a teaching tool or a simple entry level tool to the field of artificial intelligence.

Nkululeko_poster.pdf

Icassp poster (239)

Categories:: Audio and Acoustic Signal Processing

24 Views

Masking speech contents by random splicing: Is emotional expression preserved?

Read more about Masking speech contents by random splicing: Is emotional expression preserved?
Log in to post comments

We discuss the influence of random splicing on the perception of emotional expression in speech signals.
Random splicing is the randomized reconstruction of short audio snippets with the aim to obfuscate the speech contents.
A part of the German parliament recordings has been random spliced and both versions -- the original and the scrambled ones -- manually labeled with respect to the arousal, valence and dominance dimensions.
Additionally, we run a state-of-the-art transformer-based pre-trained emotional model on the data.

Random_splicing_ICASSP.pdf

Random_splicing_ICASSP.pdf (239)

Categories:: Audio and Acoustic Signal Processing

47 Views

Audio and Acoustic Signal Processing

Pages