Sorry, you need to enable JavaScript to visit this website.

CIF-RNNT: Streaming ASR via Acoustic Word Embeddings with Continuous Integrate-and-Fire and RNN-Transducers

Citation Author(s):
Yasuhiro Minami
Submitted by:
Wen Shen Teo
Last updated:
14 April 2024 - 11:35pm
Document Type:
Document Year:
Teo Wen Shen
Paper Code:

This paper introduces CIF-RNNT, a model that incorporates Continuous Integrate-and-Fire into RNN-Transducers (RNNTs) for streaming ASR via acoustic word embeddings (AWEs). CIF can dynamically compress long sequences into shorter ones, while RNNTs can produce multiple symbols given an input vector. We demonstrate that our model can not only streamingly segment acoustic information and produce AWEs, but also recover the represented word using a fixed set of output tokens with a shorter decoding time. Moreover, we improved CIF with new mechanisms that outperformed conventional ones when evaluated on Japanese and English ASR datasets. As the first attempt at combining CIF with RNNT, this paper advances our understanding of applying CIF’s dynamic compression capabilities to obtain AWEs for streaming ASR and paves the way for speech and text integration via words instead of architecturally confined tokens.

0 users have voted: