Speech Processing

Towards Transferable Speech Emotion Representation: On Loss Functions For Cross-Lingual Latent Representations

In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques which provide transfer learning possibilities. However, generalizing over languages, corpora and recording conditions is still an open challenge. In this work we address this gap by exploring loss functions that aid in transferability, specifically to non-tonal languages.

icassp_2022_presentation.pdf

icassp_2022_presentation.pdf (302)

Categories:: Speech Processing

31 Views

Customer Satisfaction Estimation using Unsupervised Representation Learning with Multi-Format Prediction Loss

2204_ICASSP22_CSE_Unsupervised_v3.pdf

2204_ICASSP22_CSE_Unsupervised_v3.pdf (320)

Categories:: Speech Processing

15 Views

A transfer learning approach to pronunciation scoring

Read more about A transfer learning approach to pronunciation scoring
Log in to post comments

Phone-level pronunciation scoring is a challenging task, with performance far from that of human annotators. Standard systems generate a score for each phone in a phrase using models trained for automatic speech recognition (ASR) with native data only. Better performance has been shown when using systems that are trained specifically for the task using non-native data. Yet, such systems face the challenge that datasets labelled for this task are scarce and usually small.

2022-ICASSP (1).pdf

GOP-FT (308)

Categories:: Speech Processing

37 Views

THE VICOMTECH AUDIO DEEPFAKE DETECTION SYSTEM BASED ON WAV2VEC2 FOR THE 2022 ADD CHALLENGE

This paper describes our submitted systems to the 2022 ADD challenge withing the tracks 1 and 2. Our approach is based on the combination of a pre-trained wav2vec2 feature extractor and a downstream classifier to detect spoofed audio. This method exploits the contextualized speech representations at the different transformer layers to fully capture discriminative information. Furthermore, the classification model is adapted to the application scenario using different data augmentation techniques.

poster_CHAL_5_6.pdf

Poster (262)

presentacion_ICASSP_22.pdf

Presentation (397)

Categories:: Speech Processing

65 Views

CARINA – A CORPUS OF ALIGNED GERMAN READ SPEECH INCLUDING ANNOTATIONS

Read more about CARINA – A CORPUS OF ALIGNED GERMAN READ SPEECH INCLUDING ANNOTATIONS
Log in to post comments

This paper presents the semi-automatically created Corpus of Aligned Read Speech Including Annotations (CARInA), a speech corpus based on the German Spoken Wikipedia Corpus (GSWC). CARInA tokenizes, consolidates and organizes the vast, but rather unstructured material contained in GSWC. The contents are grouped by annotation completeness, and extended by canonic, morphosyntactic and prosodic annotations. The annotations are provided in BPF and TextGrid format.

CARInA_Poster.pdf

Poster CARInA (281)

Categories:: Speech Processing

30 Views

CURRICULUM OPTIMIZATION FOR LOW-RESOURCE SPEECH RECOGNITION

Read more about CURRICULUM OPTIMIZATION FOR LOW-RESOURCE SPEECH RECOGNITION
Log in to post comments

Curriculum_Learning_Optimization.pdf

Curriculum_Learning_Optimization.pdf (259)

Categories:: Speech Processing

14 Views

FAST-RIR: FAST NEURAL DIFFUSE ROOM IMPULSE RESPONSE GENERATOR

Read more about FAST-RIR: FAST NEURAL DIFFUSE ROOM IMPULSE RESPONSE GENERATOR
Log in to post comments

We present a neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment. Our FAST-RIR takes rectangular room dimensions, listener and speaker positions, and reverberation time as inputs and generates specular and diffuse reflections for a given acoustic environment. Our FAST-RIR is capable of generating RIRs for a given input reverberation time with an average error of 0.02s.

slides.pptx

Presentation Slides (285)

Categories:: Audio and Acoustic Signal Processing
Speech Processing

70 Views

Transfer Learning for Robust Low-Resource Children's Speech ASR with Transformers and Source-Filter Warping

Automatic Speech Recognition (ASR) systems are known to exhibit difficulties when transcribing children's speech. This can mainly be attributed to the absence of large children’s speech corpora to train robust ASR models and the resulting domain mismatch when decoding children’s speech with systems trained on adult data. In this paper, we propose multiple enhancements to alleviate these issues. First, we propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech.

Transfer_Learning_for_Robust_Low_Resource_Child_Speech_ASR_with_Transformers_and_Source_Filter_Warping__COPY2_-4.pdf

Transfer_Learning_for_Robust_Low_Resource_Child_Speech_ASR_with_Transformers_and_Source_Filter_Warping__COPY2_-4.pdf (210)

icassp_2022_poster_v2.pdf

icassp_2022_poster_v2.pdf (362)

Categories:: Speech Processing

15 Views

Characterizing the adversarial vulnerability of speech self-supervised learning

Read more about Characterizing the adversarial vulnerability of speech self-supervised learning
Log in to post comments

A leaderboard named Speech processing Universal PERformance Benchmark (SUPERB), which aims at benchmarking the performance of a shared self-supervised learning (SSL) speech model across various downstream speech tasks with minimal modification of architectures and a small amount of data, has fueled the research for speech representation learning. The SUPERB demonstrates speech SSL upstream models improve the performance of various downstream tasks through just minimal adaptation.

AttackSSL-poster.pdf

AttackSSL-poster.pdf (225)

Categories:: Speech Processing

8 Views

Adversarial sample detection for speaker verification by neural vocoders

Read more about Adversarial sample detection for speaker verification by neural vocoders
Log in to post comments

Automatic speaker verification (ASV), one of the most important technology for biometric identification, has been widely adopted in security-critical applications. However, ASV is seriously vulnerable to recently emerged adversarial attacks, yet effective countermeasures against them are limited. In this paper, we adopt neural vocoders to spot adversarial samples for ASV.

Vocoder-report.pdf

Vocoder-report.pdf (224)

Categories:: Speech Processing

19 Views

Speech Processing

Pages