CIF-RNNT: Streaming ASR via Acoustic Word Embeddings with Continuous Integrate-and-Fire and RNN-Transducers

This paper introduces CIF-RNNT, a model that incorporates Continuous Integrate-and-Fire into RNN-Transducers (RNNTs) for streaming ASR via acoustic word embeddings (AWEs). CIF can dynamically compress long sequences into shorter ones, while RNNTs can produce multiple symbols given an input vector. We demonstrate that our model can not only streamingly segment acoustic information and produce AWEs, but also recover the represented word using a fixed set of output tokens with a shorter decoding time. Moreover, we improved CIF with new mechanisms that outperformed conventional ones when evaluated on Japanese and English ASR datasets. As the first attempt at combining CIF with RNNT, this paper advances our understanding of applying CIF’s dynamic compression capabilities to obtain AWEs for streaming ASR and paves the way for speech and text integration via words instead of architecturally confined tokens.

ICASSP2024 Poster v240410.pdf

CIF-RNNT: Streaming ASR via Acoustic Word Embeddings with Continuous Integrate-and-Fire and RNN-Transducers (552)

Thumbs Up

CITE

Documents

Poster

CIF-RNNT: Streaming ASR via Acoustic Word Embeddings with Continuous Integrate-and-Fire and RNN-Transducers

ICASSP2024 Poster v240410.pdf

QUESTIONS?