Audio and Acoustic Signal Processing

Tunisian Code Switched ASR Presentation

Read more about Tunisian Code Switched ASR Presentation
Log in to post comments

Crafting an effective Automatic Speech Recognition (ASR) solution for dialects demands innovative approaches that not only address the data scarcity issue but also navigate the intricacies of linguistic diversity. In this paper, we address the aforementioned ASR challenge, focusing on the Tunisian dialect. First, textual and audio data is collected and in some cases annotated.

Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition.pdf

Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition.pdf (162)

Categories:: Audio and Acoustic Signal Processing

25 Views

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this work, we strive to improve the performance of seq2seq AAC models by extensively leveraging pretrained models and large language models (LLMs).

aac_icassp_talk_240418_slides.pptx

monday version (125)

Categories:: Audio and Acoustic Signal Processing

23 Views

ICASSP2024 Poster: Sound Field Interpolation for Rotation-Invariant Multichannel Array Signal Processing

Inthis paper,we present a sound field interpolation for array signal processing (ASP) that is robust to rotation of a circular microphone array (CMA), and we evaluate beamforming as one of its applications. Most ASP methods assume a time-invariantacoustic transfer system (ATS) from sources to the microphone array. This assumption makes it challenging to perform ASP in real situations where sources and the microphone array can move. Therefore, considering a time-variant ATS is an essential task for the use of ASP.

202404ICASSP_poster_wakabayashi.pdf

ICASSP2024 SPS journal presentation (144)

Categories:: Audio and Acoustic Signal Processing

7 Views

PHASE RECONSTRUCTION IN SINGLE CHANNEL SPEECH ENHANCEMENT BASED ON PHASE GRADIENTS AND ESTIMATED CLEAN-SPEECH AMPLITUDES

Phase gradients can help enforce phase consistency across time and frequency, further improving the output of speech enhancement approaches. Recently, neural networks were used to estimate the phase gradients from the short-term amplitude spectra of clean speech. These were then used to synthesise phase to reconstruct a plausible time-domain signal. However, using purely synthetic phase in speech enhancement yields unnatural-sounding output. Therefore we derive a closed-form phase estimate that combines the synthetic phase with that of the enhanced speech, yielding more natural output.

icassp2024_poster_YS.pdf

icassp2024_poster_YS.pdf (192)

Categories:: Audio and Acoustic Signal Processing

68 Views

icassp2024 poster

Read more about icassp2024 poster
Log in to post comments

ICASSP2024 poster

poster-08a.pdf

poster-08a.pdf (170)

Categories:: Audio and Acoustic Signal Processing

48 Views

THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION

Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges.

misp2023ppt.pptx

ppt (159)

Categories:: Audio and Acoustic Signal Processing
Other

49 Views

A Machine-Learning Model for Detecting Depression, Anxiety, and Stress from Speech

Read more about A Machine-Learning Model for Detecting Depression, Anxiety, and Stress from Speech
Log in to post comments

Predicting mental health conditions from speech has been widely explored in recent years. Most studies rely on a single sample from each subject to detect indicators of a particular disorder. These studies ignore two important facts: certain mental disorders tend to co-exist, and their severity tends to vary over time.

poster_8111_final.pdf

poster_8111_final.pdf (172)

Categories:: Audio and Acoustic Signal Processing

45 Views

CORN: Co-Trained Full- And No-Reference Speech Quality Assessment

Read more about CORN: Co-Trained Full- And No-Reference Speech Quality Assessment
Log in to post comments

Perceptual evaluation constitutes a crucial aspect of various audio-processing tasks. Full reference (FR) or similarity-based metrics rely on high-quality reference recordings, to which lower-quality or corrupted versions of the recording may be compared for evaluation. In contrast, no-reference (NR) metrics evaluate a recording without relying on a reference. Both the FR and NR approaches exhibit advantages and drawbacks relative to each other.

2310.09388.pdf

Paper (169)

Categories:: Audio and Acoustic Signal Processing

28 Views

DELVING DEEPER INTO VULNERABLE SAMPLES IN ADVERSARIAL TRAINING

Read more about DELVING DEEPER INTO VULNERABLE SAMPLES IN ADVERSARIAL TRAINING
Log in to post comments

Recently, vulnerable samples have been shown to be crucial
for improving adversarial training performance. Our analysis
on existing vulnerable samples mining methods indicate that
existing methods have two problems: 1) valuable connections
among different pairs of natural samples and their adversarial
counterparts are ignored; 2) parts of vulnerable samples are
unconsidered. To better leverage vulnerable samples, we propose INter PAir ConstrainT (INPACT) and Vulnerable Aware

poster(3).pptx

poster(3).pptx (142)

Categories:: Audio and Acoustic Signal Processing

24 Views

SUB-BAND AND FULL-BAND INTERACTIVE U-NET WITH DPRNN FOR DEMIXING CROSS-TALK STEREO MUSIC

This paper presents a detailed description of our proposed methods for the ICASSP 2024 Cadenza Challenge. Experimental results show that the proposed system can achieve better performance than official baselines.

oral-GC-L9.4.pptx

oral-GC-L9.4.pptx (136)

Categories:: Audio and Acoustic Signal Processing

18 Views

Audio and Acoustic Signal Processing

Pages