Audio Processing Systems

Self-supervised Speaker Verification Employing a Novel Clustering Algorithm

Read more about Self-supervised Speaker Verification Employing a Novel Clustering Algorithm
Log in to post comments

Clustering is an unsupervised learning technique, which leverages a large amount of unlabeled data to learn cluster-wise representations from speech. One of the most popular self-supervised techniques to train a speaker verification system is to predict the pseudo-labels using clustering algorithms and then train the speaker embedding network using the generated pseudo-labels in a discriminative manner. Therefore, pseudo-labels - driven self-supervised speaker verification systems' performance relies heavily on the accuracy of the adopted clustering algorithms.

ICASSP2024_poster_CAMSAT.pdf

ICASSP2024_poster_CAMSAT.pdf (289)

Categories:: Audio Processing Systems

83 Views

”IT IS OKAY TO BE UNCOMMON”: QUANTIZING SOUND EVENT DETECTION NETWORKS ON HARDWARE ACCELERATORS WITH UNCOMMON SUB-BYTE SUPPORT

If our noise-canceling headphones can understand our audio environments, they can then inform us of important sound events, tune equalization based on the types of content we listen to, and dynamically adjust noise cancellation parameters based on audio scenes to further reduce distraction. However, running multiple audio understanding models on headphones with a limited energy budget and on-chip memory remains a challenging task.

Yushu-et-al-Poster-v2.pdf

Yushu-et-al-Poster-v2.pdf (187)

Categories:: Audio Processing Systems

10 Views

Hybrid Packet Loss Concealment for Real-Time Networked Music Applications

Read more about Hybrid Packet Loss Concealment for Real-Time Networked Music Applications
Log in to post comments

Real-time audio communications over IP have become essential to our daily lives. Packet-switched networks, however, are inherently prone to jitter and data losses, thus creating a strong need for effective packet loss concealment (PLC) techniques.

poster_icassp2024.pdf

PARCnet - IEEE ICASSP 2024 poster (220)

Categories:: Audio Processing Systems

20 Views

Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages

We introduce a new zero resource code-switched speech benchmark designed to directly assess the code-switching capabilities of self-supervised speech encoders. We showcase a baseline system of language modeling on discrete units to demonstrate how the code-switching abilities of speech encoders can be assessed in a zero-resource manner. Our experiments encompass a variety of well-known speech encoders, including Wav2vec 2.0, HuBERT, XLSR, etc. We examine the impact of pre-training languages and model size on benchmark performance.

2310.03018.pdf

https://arxiv.org/abs/2310.03018 (228)

Categories:: Audio Processing Systems
Multilingual Recognition and Identification (SPE-MULT)
Language Modeling, for Speech and SLP (SLP-LANG)

20 Views

TF-SepNet: An Efficient 1D Kernel Design in CNNs for Low-Complexity Acoustic Scene Classification

Recent studies focus on developing efficient systems for acoustic scene classification (ASC) using convolutional neural networks (CNNs), which typically consist of consecutive kernels. This paper highlights the benefits of using separate kernels as a more powerful and efficient design approach in ASC tasks. Inspired by the time-frequency nature of audio signals, we propose TF-SepNet, a CNN architecture that separates the feature processing along the time and frequency dimensions. Features resulted from the separate paths are then merged by channels and directly forwarded to the classifier.

ICASSP2024_Poster_1.pdf

ICASSP2024_Poster_1.pdf (187)

Categories:: Audio Processing Systems

24 Views

INVERTIBLE VOICE CONVERSION WITH PARALLEL DATA

Read more about INVERTIBLE VOICE CONVERSION WITH PARALLEL DATA
Log in to post comments

This paper introduces an innovative deep learning framework for parallel voice conversion to mitigate inherent risks associated with such systems. Our approach focuses on developing an invertible model capable of countering potential spoofing threats. Specifically, we present a conversion model that allows for the retrieval of source voices, thereby facilitating the identification of the source speaker. This framework is constructed using a series of invertible modules composed of affine coupling layers to ensure the reversibility of the conversion process.

zexin_poster.pdf

zexin_poster.pdf (240)

Categories:: Audio Processing Systems

82 Views

The Design of Ultra-high Quality Feedback Amplifiers

Read more about The Design of Ultra-high Quality Feedback Amplifiers
Log in to post comments

___Although dated, this student thesis is re-published as the proposed negative feedback topology and the current mode arrangement of silicon bipolar junction transistors is rarely elaborated in the many excellent contemporary books on audio power amplifier design.

BSEE Thesis Tulane April 1974.pdf

High Performance Audio Power Amplifier (248)

Categories:: Other
Audio Processing Systems

78 Views

AERO: Audio Super Resolution in the Spectral Domain

Read more about AERO: Audio Super Resolution in the Spectral Domain
Log in to post comments

ICASSP_presentation.pdf

ICASSP_presentation.pdf (326)

Categories:: Audio Processing Systems

73 Views

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Read more about Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation
Log in to post comments

Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers.

Efficient Large-scale Audio Tagging via Transformer-to-CNN%0AKnowledge Distillation.pdf

Efficient Large-scale Audio Tagging via Transformer-to-CNN%0AKnowledge Distillation.pdf (258)

Categories:: Audio Processing Systems

32 Views

MASKED MODELING DUO: LEARNING REPRESENTATIONS BY ENCOURAGING BOTH NETWORKS TO MODEL THE INPUT

Masked Autoencoders is a simple yet powerful self-supervised learning method. However, it learns representations indirectly by reconstructing masked input patches. Several methods learn representations directly by predicting representations of masked patches; however, we think using all patches to encode training signal representations is suboptimal. We propose a new method, Masked Modeling Duo (M2D), that learns representations directly while obtaining training signals using only masked patches.

20230607-ICASSP2023-M2D-Poster.pptx

20230607-ICASSP2023-M2D-Poster.pptx (268)

Categories:: Audio Processing Systems
Neural network learning (MLR-NNLR)

24 Views

Audio Processing Systems

Pages