Spoken Language Processing

Weighted Sampling For Masked Language Modeling

Read more about Weighted Sampling For Masked Language Modeling
Log in to post comments

Masked Language Modeling (MLM) is widely used to pretrain language models. The standard random masking strategy in MLM causes the pre-trained language models (PLMs) to be biased towards high-frequency tokens. Representation learning of rare tokens is poor and PLMs have limited performance on downstream tasks. To alleviate this frequency bias issue, we propose two simple and effective Weighted Sampling strategies for masking tokens based on token frequency and training loss. We apply these two strategies to BERT and obtain Weighted-Sampled BERT (WSBERT).

ICASSP2023-WeightSampling-short-for-video.pdf

ICASSP2023-WeightSampling-short-for-video.pdf (208)

Categories:: Spoken Language Processing

32 Views

End-to-end Keyword Spotting using Neural Architecture Search and Quantization

Read more about End-to-end Keyword Spotting using Neural Architecture Search and Quantization
Log in to post comments

This paper introduces neural architecture search (NAS) for the automatic discovery of end-to-end keyword spotting (KWS) models in limited resource environments. We employ a differentiable NAS approach to optimize the structure of convolutional neural networks (CNNs) operating on raw audio waveforms. After a suitable KWS model is found with NAS, we conduct quantization of weights and activations to reduce the memory footprint. We conduct extensive experiments on the Google speech commands dataset.

icassp_2022_poster.pdf

Poster (308)

LATEX_Presentation-TUG-SPSC.pdf

Slides (444)

peter.pdf

Paper (261)

Categories:: Pattern recognition and classification (MLR-PATT)
Spoken Language Processing

11 Views

Poster for ICASSP2022-#1876

Read more about Poster for ICASSP2022-#1876
Log in to post comments

poster_icassp2022.pdf

poster_icassp2022.pdf (236)

Categories:: Spoken Language Processing

12 Views

LEARNING TO SELECT CONTEXT IN A HIERARCHICAL AND GLOBAL PERSPECTIVE FOR OPEN-DOMAIN DIALOGUE GENERATION

ICASSP2021-landscape.pdf

ICASSP2021-landscape.pdf (337)

Categories:: Spoken Language Processing

18 Views

AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

This paper describes a method based on a sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis and recognition, machine translation, and image captioning.

2019_05_ICASSP_KouTanaka.pdf

2019_05_ICASSP_KouTanaka.pdf (689)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)
Spoken Language Processing

71 Views

Towards Better Confidence Estimation for Neural Models

Read more about Towards Better Confidence Estimation for Neural Models
Log in to post comments

conference_poster_5.pdf

conference_poster_5.pdf (601)

Categories:: Spoken Language Processing

22 Views

End-to-End Anchored Speech Recognition

Read more about End-to-End Anchored Speech Recognition
Log in to post comments

Voice-controlled house-hold devices, like Amazon Echo or Google Home, face the problem of performing speech recognition of device- directed speech in the presence of interfering background speech, i.e., background noise and interfering speech from another person or media device in proximity need to be ignored. We propose two end-to-end models to tackle this problem with information extracted from the “anchored segment”.

ICASSP19_Poster_AnchoredSpeechRecogWithAttention.pdf

ICASSP19_Poster_AnchoredSpeechRecogWithAttention.pdf (682)

Categories:: Spoken Language Processing

123 Views

Robust Spoken Language Understanding with unsupervised ASR-error adaptation

Read more about Robust Spoken Language Understanding with unsupervised ASR-error adaptation
Log in to post comments

Robustness to errors produced by automatic speech recognition (ASR) is essential for Spoken Language Understanding (SLU). Traditional robust SLU typically needs ASR hypotheses with semantic annotations for training. However, semantic annotation is very expensive, and the corresponding ASR system may change frequently. Here, we propose a novel unsupervised ASR-error adaptation method, obviating the need of annotated ASR hypotheses.

zhu-icassp18-poster.pdf

zhu-icassp18-poster.pdf (727)

Categories:: Spoken Language Processing

80 Views

DEEP MULTIMODAL LEARNING FOR EMOTION RECOGNITION IN SPOKEN LANGUAGE

Read more about DEEP MULTIMODAL LEARNING FOR EMOTION RECOGNITION IN SPOKEN LANGUAGE
Log in to post comments

In this paper, we present a novel deep multimodal framework to predict human emotions based on sentence-level spoken language. Our architecture has two distinctive characteristics. First, it extracts the high-level features from both text and audio via a hybrid deep multimodal structure, which considers the spatial information from text, temporal information from audio, and high-level associations from low-level handcrafted features.

ICASSP_2018_POSTER.pdf

ICASSP_2018_POSTER.pdf (926)

Categories:: Spoken Language Processing

22 Views

FACTORIZED HIDDEN VARIABILITY LEARNING FOR ADAPTATION OF SHORT DURATION LANGUAGE IDENTIFICATION MODELS

Bidirectional long short term memory (BLSTM) recurrent neural networks (RNNs) have recently outperformed other state-of-the-art approaches, such as i-vector and deep neural networks (DNNs) in automatic language identification (LID), particularly when testing with very short utterances (∼3s). Mismatches conditions between training and test data, e.g. speaker, channel, duration and environmental noise, are a major source of performance degradation for LID.

POSTER.pdf

POSTER.pdf (1742)

Categories:: Audio and Acoustic Signal Processing
Speech Processing
Spoken Language Processing

11 Views

Spoken Language Processing

Pages