Documents
Presentation Slides
Investigating End-to-end ASR Architectures for Long form Audio Transcription
- DOI:
- 10.60864/s9sh-5f97
- Citation Author(s):
- Submitted by:
- Somshubra Majumdar
- Last updated:
- 6 June 2024 - 10:21am
- Document Type:
- Presentation Slides
- Document Year:
- 2024
- Event:
- Presenters:
- Somshubra Majumdar
- Paper Code:
- SS-L22.5
- Categories:
- Log in to post comments
This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maximum audio length and real-time factor for each model on a variety of long audio benchmarks: Earnings-21 and 22, CORAAL, and TED-LIUM3. The model from the category of self-attention with local attention and global token has the best accuracy comparing to other architectures. We also compared models with CTC and RNNT decoders and showed that CTC-based models are more robust and efficient than RNNT on long form audio.