Audio and Acoustic Signal Processing

Natural Sound Rendering for Headphones: Integration of signal processing techniques (slides)

With the strong growth of assistive and personal listening devices, natural sound rendering over headphones is becoming a necessity for prolonged listening in multimedia and virtual reality applications. The aim of natural sound rendering is to naturally recreate the sound scenes with the spatial and timbral quality as natural as possible, so as to achieve a truly immersive listening experience. However, rendering natural sound over headphones encounters many challenges. This tutorial article presents signal processing techniques to tackle these challenges to assist human listening.

SPM15slides_Natural Sound Rendering for Headphones.pdf

SPM15slides_Natural Sound Rendering for Headphones.pdf (112)

Categories:: Spatial and Multichannel Audio
Audio and Acoustic Signal Processing

94 Views

[poster] Improving Design of Input Condition Invariant Speech Enhancement

Read more about [poster] Improving Design of Input Condition Invariant Speech Enhancement
Log in to post comments

Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we deﬁne here as “input condition invariant SE”. Such a model was recently proposed showing promising performance; however, its multi-channel performance degraded severely in real conditions.

poster_USES2.pdf

poster_USES2.pdf (250)

Categories:: Audio and Acoustic Signal Processing

39 Views

[slides] Generation-Based Target Speech Extraction with Speech Discretization and Vocoder

Target speech extraction (TSE) is a task aiming at isolating the speech of a specific target speaker from an audio mixture, with the help of an auxiliary recording of that target speaker. Most existing TSE methods employ discrimination-based models to estimate the target speaker’s proportion in the mixture, but they often fail to compensate for the missing or highly corrupted frequency components in the speech signal. In contrast, the generation-based methods can naturally handle such scenarios via speech resynthesis.

slides_icassp_discrete_tse_oral.pdf

slides_icassp_discrete_tse_oral.pdf (253)

Categories:: Audio and Acoustic Signal Processing

47 Views

Poster

Read more about Poster
Log in to post comments

This paper introduces BWSNet, a model that can be trained from raw human judgements obtained through a Best-Worst scaling (BWS) experiment. It maps sound samples into an embedded space that represents the perception of a studied attribute. To this end, we propose a set of cost functions and constraints, interpreting trial-wise ordinal relations as distance comparisons in a metric learning task. We tested our proposal on data from two BWS studies investigating the perception of speech social attitudes and timbral qualities.

poster_ready_ICASSP.pdf

poster_ready_ICASSP.pdf (257)

Categories:: Audio and Acoustic Signal Processing

70 Views

ENHANCING MULTILINGUAL TTS WITH VOICE CONVERSION BASED DATA AUGMENTATION AND POSTERIOR EMBEDDING

This paper proposes a multilingual, multi-speaker (MM) TTS system by using a voice conversion (VC)-based data augmentation method. Creating an MM-TTS model is challenging, owing to the difficulties of collecting polyglot data from multiple speakers. To address this problem, we adopt a cross-lingual, multi-speaker VC model trained with multiple speakers’ monolingual databases. As this model effectively transfers acoustic attributes while retaining the content information, it is possible to generate each speaker’s polyglot corpora.

Poster_Final_v2.pdf

Poster_Final_v2.pdf (258)

Categories:: Audio and Acoustic Signal Processing

47 Views

https://sigport.org/events/documents/ieee-icassp-2024

Read more about https://sigport.org/events/documents/ieee-icassp-2024
Log in to post comments

We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90.

Presentation.pptx

Presentation.pptx (276)

Categories:: Audio and Acoustic Signal Processing

16 Views

Can Large-scale Vocoded Spoofed Data Improve Speech Spoofing Countermeasure with a Self-supervised Front End?

A speech spoofing countermeasure (CM) that discriminates between unseen spoofed and bona fide data requires diverse training data. While many datasets use spoofed data generated by speech synthesis systems, it was recently found that data vocoded by neural vocoders were also effective as the spoofed training data. Since many neural vocoders are fast in building and generation, this study used multiple neural vocoders and created more than 9,000 hours of vocoded data on the basis of the VoxCeleb2 corpus.

ICASSP24-SLP.L20.2.pdf

ICASSP24-SLP.L20.2.pdf (220)

Categories:: Audio and Acoustic Signal Processing

36 Views

TIA: A TEACHING INTONATION ASSESSMENT DATASET IN REAL TEACHING SITUATIONS

Read more about TIA: A TEACHING INTONATION ASSESSMENT DATASET IN REAL TEACHING SITUATIONS
Log in to post comments

Intonation is one of the important factors affecting the teaching language arts, so it is an urgent problem to be addressed by evaluating the teachers’ intonation through artificial intelligence technology. However, the lack of an intonation assessment dataset has hindered the development of the field. To this end, this paper constructs a Teaching Intonation Assessment (TIA) dataset for the first time in real teaching situations. This dataset covers 9 disciplines, 396 teachers, total of 11,444 utterance samples with a length of 15 seconds.

TIA_icassp24.pptx

TIA: A TEACHING INTONATION ASSESSMENT DATASET IN REAL TEACHING SITUATIONS (251)

Categories:: Signal Processing Education
Audio and Acoustic Signal Processing

28 Views

SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation

We present a novel Speech Augmented Language Model (SALM) with multitask and in-context learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST.