General Topics in Speech Recognition (SPE-GASR)

FOLLOWING THE EMBEDDING: IDENTIFYING TRANSITION PHENOMENA IN WAV2VEC 2.0 REPRESENTATIONS OF SPEECH AUDIO

Although transformer-based models have improved the state-of-the-art in speech recognition, it is still not well understood what information from the speech signal these models encode in their latent representations. This study investigates the potential of using labelled data (TIMIT) to probe wav2vec 2.0 embeddings for insights into the encoding and visualisation of speech signal information at phone boundaries. Our experiment involves training probing models to detect phone-specific articulatory features in the hidden layers based on IPA classifications.

ICASSP2024_poster_follwing_the_embedding.pdf

ICASSP2024_poster_follwing_the_embedding.pdf (274)

Categories:: Speech Analysis (SPE-ANLS)
General Topics in Speech Recognition (SPE-GASR)
Other

81 Views

BRAVEn: Improving Self-supervised Pre-training for Visual and Auditory Speech Recognition

Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works.

icassp slides.pptx

icassp slides.pptx (173)

Categories:: General Topics in Speech Recognition (SPE-GASR)

28 Views

CIF-RNNT: Streaming ASR via Acoustic Word Embeddings with Continuous Integrate-and-Fire and RNN-Transducers

This paper introduces CIF-RNNT, a model that incorporates Continuous Integrate-and-Fire into RNN-Transducers (RNNTs) for streaming ASR via acoustic word embeddings (AWEs). CIF can dynamically compress long sequences into shorter ones, while RNNTs can produce multiple symbols given an input vector. We demonstrate that our model can not only streamingly segment acoustic information and produce AWEs, but also recover the represented word using a fixed set of output tokens with a shorter decoding time.

ICASSP2024 Poster v240410.pdf

CIF-RNNT: Streaming ASR via Acoustic Word Embeddings with Continuous Integrate-and-Fire and RNN-Transducers (266)

Categories:: General Topics in Speech Recognition (SPE-GASR)

86 Views

INVESTIGATING THE CLUSTERS DISCOVERED BY PRE-TRAINED AV-HUBERT

Read more about INVESTIGATING THE CLUSTERS DISCOVERED BY PRE-TRAINED AV-HUBERT
Log in to post comments

Self-supervised models, such as HuBERT and its audio-visual version AV-HuBERT, have demonstrated excellent performance on various tasks. The main factor for their success is the pre-training procedure, which requires only raw data without human transcription. During the self-supervised pre-training phase, HuBERT is trained to discover latent clusters in the training data, but these clusters are discarded, and only the last hidden layer is used by the conventional finetuning step.

ICASSP_2024_AV_HuBERT.pdf

ICASSP_2024_AV_HuBERT.pdf (149)

Categories:: General Topics in Speech Recognition (SPE-GASR)

32 Views

USM-Lite: Quantization And Sparsity Aware Fine-Tuning For Speech Recognition With Universal Speech Models

End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios.

[ICASSP 2024] USM-Lite.pdf

[ICASSP 2024] USM-Lite.pdf (126)

Categories:: General Topics in Speech Recognition (SPE-GASR)

17 Views

Unimodal Aggregation for CTC-based Speech Recognition

Read more about Unimodal Aggregation for CTC-based Speech Recognition
Log in to post comments

This paper works on non-autoregressive automatic speech recognition. A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token, and thus to learn better feature representations for text tokens. The frame-wise features and weights are both derived from an encoder. Then, the feature frames with unimodal weights are integrated and further processed by a decoder. Connectionist temporal classification (CTC) loss is applied for training.

fangying_UMA_poster4.0.pdf

UMA Poster for ICASSP 2024 (201)

Categories:: General Topics in Speech Recognition (SPE-GASR)

29 Views

ENABLING ON-DEVICE TRAINING OF SPEECH RECOGNITION MODELS WITH FEDERATED DROPOUT

Read more about ENABLING ON-DEVICE TRAINING OF SPEECH RECOGNITION MODELS WITH FEDERATED DROPOUT
Log in to post comments

Federated learning can be used to train machine learning models on the edge on local data that never leave devices, providing privacy by default. This presents a challenge pertaining to the communication and computation costs associated with clients’ devices. These costs are strongly correlated with the size of the model being trained, and are significant for state-of-the-art automatic speech recognition models.We propose using federated dropout to reduce the size of client models while training a full-size model server-side.

[Poster] Enabling On-Device Training of Speech Recognition Models with Federated Dropout (1).pdf

[Poster] Enabling On-Device Training of Speech Recognition Models with Federated Dropout (1).pdf (491)

Categories:: General Topics in Speech Recognition (SPE-GASR)
Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)

40 Views

ENABLING ON-DEVICE TRAINING OF SPEECH RECOGNITION MODELS WITH FEDERATED DROPOUT

Read more about ENABLING ON-DEVICE TRAINING OF SPEECH RECOGNITION MODELS WITH FEDERATED DROPOUT
Log in to post comments

[Presentation] ICASSP 2022 Federated Dropout.pdf

[Presentation] ICASSP 2022 Federated Dropout.pdf (315)

Categories:: General Topics in Speech Recognition (SPE-GASR)

31 Views

SPE-89.4: UNSUPERVISED DATA SELECTION FOR SPEECH RECOGNITION WITH CONTRASTIVE LOSS RATIOS

This paper proposes an unsupervised data selection method by using a submodular function based on contrastive loss ratios of target and training data sets. A model using a contrastive loss function is trained on both sets. Then the ratio of frame-level losses for each model is used by a submodular function. By using the submodular function, a training set for automatic speech recognition matching the target data set is selected.

park_ICASSP2022_in_person_poster.pdf

Poster for the in-person conference of ICASSP 2022 (253)

Presentation slides for the in-person conference of ICASSP 2022.pdf

Presentation slides for the in-person conference of ICASSP 2022 (311)

Categories:: General Topics in Speech Recognition (SPE-GASR)

41 Views

Speech Recognition Using Biologically-Inspired Neural Networks

Read more about Speech Recognition Using Biologically-Inspired Neural Networks
Log in to post comments

Automatic speech recognition systems (ASR), such as the recurrent neural network transducer (RNN-T), have reached close to human-like performance and are deployed in commercial applications. However, their core operations depart from the powerful biological counterpart, the human brain. On the other hand, the current developments in biologically-inspired ASR models lag behind in terms of accuracy and focus primarily on small-scale applications.

Poster_page1.pdf

Poster_page1.pdf (228)

Categories:: General Topics in Speech Recognition (SPE-GASR)

18 Views

General Topics in Speech Recognition (SPE-GASR)

Pages