Documents
Presentation Slides
SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation
- DOI:
- 10.60864/q632-wm31
- Citation Author(s):
- Submitted by:
- Subhankar Ghosh
- Last updated:
- 6 June 2024 - 10:24am
- Document Type:
- Presentation Slides
- Document Year:
- 2024
- Event:
- Presenters:
- Zhehuai Chen
- Paper Code:
- SS-L11.1
- Categories:
- Log in to post comments
We present a novel Speech Augmented Language Model (SALM) with multitask and in-context learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST. Moreover, speech supervised in-context training is proposed to bridge the gap between LLM training and downstream speech tasks, which further boosts the in-context learning ability of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.