Language Modeling, for Speech and SLP (SLP-LANG)

VOXTLM: UNIFIED DECODER-ONLY MODELS FOR CONSOLIDATING SPEECH RECOGNITION, SYNTHESIS AND SPEECH, TEXT CONTINUATION TASKS

We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90.

voxtlm_icassp2024_presentation.pptx

voxtlm_icassp2024_presentation.pptx (187)

Categories:: Language Modeling, for Speech and SLP (SLP-LANG)
Speech Synthesis and Generation, including TTS (SPE-SYNT)

18 Views

VOXTLM: UNIFIED DECODER-ONLY MODELS FOR CONSOLIDATING SPEECH RECOGNITION, SYNTHESIS AND SPEECH, TEXT CONTINUATION TASKS

Presentation.pptx

Presentation.pptx (161)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)
Language Modeling, for Speech and SLP (SLP-LANG)

29 Views

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.3% on YouTube captioning.

ICASSP2024_slides.pdf

Slides (119)

Categories:: Language Modeling, for Speech and SLP (SLP-LANG)

35 Views

Towards a World-English Language Model for On-Device Virtual Assistants

Read more about Towards a World-English Language Model for On-Device Virtual Assistants
Log in to post comments

Neural Network Language Models (NNLMs) for Virtual Assistants (VAs) are generally language-, region-, and in some cases, device-dependent, which increases the effort to scale and maintain them. Combining NNLMs for one or more of the categories is one way to improve scalability. In this work, we combine regional variants of English to build a ``World English'' NNLM for on-device VAs. In particular, we investigate the application of adapter bottlenecks to model dialect-specific characteristics in our existing production NNLMs and enhance the multi-dialect baselines.

ICASSP'24 Poster - Vertical -2023 copy 2.pdf

ICASSP'24 Poster - Vertical -2023 copy 2.pdf (115)

Categories:: Language Modeling, for Speech and SLP (SLP-LANG)

22 Views

MULTIPLE REPRESENTATION TRANSFER FROM LARGE LANGUAGE MODELS TO END-TO-END ASR SYSTEMS

Read more about MULTIPLE REPRESENTATION TRANSFER FROM LARGE LANGUAGE MODELS TO END-TO-END ASR SYSTEMS
Log in to post comments

Transferring the knowledge of large language models (LLMs) is a promising technique to incorporate linguistic knowledge into end-to-end automatic speech recognition (ASR) systems. However, existing works only transfer a single representation of LLM (e.g. the last layer of pretrained BERT), while the representation of a text is inherently non-unique and can be obtained variously from different layers, contexts and models. In this work, we explore a wide range of techniques to obtain and transfer multiple representations of LLMs into a transducer-based ASR system.

icassp2023_poster.pdf

icassp2023_poster.pdf (227)

Categories:: Language Modeling, for Speech and SLP (SLP-LANG)

10 Views

Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages

We introduce a new zero resource code-switched speech benchmark designed to directly assess the code-switching capabilities of self-supervised speech encoders. We showcase a baseline system of language modeling on discrete units to demonstrate how the code-switching abilities of speech encoders can be assessed in a zero-resource manner. Our experiments encompass a variety of well-known speech encoders, including Wav2vec 2.0, HuBERT, XLSR, etc. We examine the impact of pre-training languages and model size on benchmark performance.

2310.03018.pdf

https://arxiv.org/abs/2310.03018 (170)

Categories:: Audio Processing Systems
Multilingual Recognition and Identification (SPE-MULT)
Language Modeling, for Speech and SLP (SLP-LANG)

20 Views

Interpreting Intermediate Convolutional Layers of Generative CNNs Trained on Waveforms

This paper presents a technique to interpret and visualize intermediate layers in generative CNNs trained on raw speech data in an unsupervised manner. We argue that averaging over feature maps after ReLU activation in each transpose convolutional layer yields interpretable time-series data. This technique allows for acoustic analysis of intermediate layers that parallels the acoustic analysis of human speech data: we can extract F0, intensity, duration, formants, and other acoustic properties from intermediate layers in order to test where and how CNNs encode various types of information.

begus ICASSP 2023 talk.pdf

begus ICASSP 2023 talk.pdf (280)

Categories:: Neural network learning (MLR-NNLR)
Cognitive information processing (MLR-COGP)
Speech Synthesis and Generation, including TTS (SPE-SYNT)
Human Spoken Language Acquisition, Development and Learning (SLP-LADL)
Language Modeling, for Speech and SLP (SLP-LANG)

38 Views

Articulation GAN: Unsupervised Modeling of Articulatory Learning

Read more about Articulation GAN: Unsupervised Modeling of Articulatory Learning
Log in to post comments

Generative deep neural networks are widely used for speech synthesis, but most existing models directly generate waveforms or spectral outputs. Humans, however, produce speech by controlling articulators, which results in the production of speech sounds through physical properties of sound propagation. We introduce the Articulatory Generator to the Generative Adversarial Network paradigm, a new unsupervised generative model of speech production/synthesis.

Begus Zhou Wu Anumanchipalli 5406 Articulation GAN ICASSP 2023.pdf

Begus Zhou Wu Anumanchipalli 5406 Articulation GAN ICASSP 2023.pdf (286)

Categories:: Speech Production (SPE-SPRD)
Speech Synthesis and Generation, including TTS (SPE-SYNT)
Human Spoken Language Acquisition, Development and Learning (SLP-LADL)
Language Modeling, for Speech and SLP (SLP-LANG)
Bioacoustics and Medical Acoustics

78 Views

Integrating multiple ASR systems into NLP backend with attention fusion

Read more about Integrating multiple ASR systems into NLP backend with attention fusion
Log in to post comments

ICASSP_kano_v3.pdf

ICASSP_kano_v3.pdf (227)

Categories:: Machine Translation of Speech (SLP-SSMT)
Language Modeling, for Speech and SLP (SLP-LANG)

27 Views

SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH

The automatic discovery of acoustic sub-word units from raw speech, without any text or labels, is a growing field of research. The key challenge is to derive representations of speech that can be categorized into a small number of phoneme-like units which are speaker invariant and can broadly capture the content variability of speech. In this work, we propose a novel neural network paradigm that uses the deep clustering loss along with the autoregressive con- trastive predictive coding (CPC) loss. Both the loss functions, the CPC and the clustering loss, are self-supervised.

ICASSP_Varun.pptx

ICASSP_Varun.pptx (182)

Categories:: Language Modeling, for Speech and SLP (SLP-LANG)

26 Views

Language Modeling, for Speech and SLP (SLP-LANG)

Pages