Documents
Poster
Integration of Pre-trained Networks with Continuous Token Interface For End-to-End Spoken Language Understanding
- Citation Author(s):
- Submitted by:
- SEUNGHYUN SEO
- Last updated:
- 9 May 2022 - 11:07pm
- Document Type:
- Poster
- Document Year:
- 2022
- Event:
- Presenters:
- Seunghyun Seo
- Paper Code:
- 2481
- Categories:
- Keywords:
- Log in to post comments
Most End-to-End (E2E) Spoken Language Understanding (SLU) networks leverage the pre-trained Automatic Speech Recognition (ASR) networks but still lack the capability to understand the semantics of utterances, crucial for the SLU task. To solve this, recently proposed studies use pre-trained Natural Language Understanding (NLU) networks. However, it is not trivial to fully utilize both pre-trained networks; many solutions were proposed, such as Knowledge Distillation (KD), cross-modal shared embedding, and network integration with Interface. We propose a simple and robust integration method for the E2E SLU network with a novel Interface, Continuous Token Interface (CTI). CTI is a junctional representation of the ASR and NLU networks when both networks are pre-trained with the same vocabulary. Thus, we can train our SLU network in an E2E manner without additional modules, such as Gumbel-Softmax. We evaluate our model using SLURP, a challenging SLU dataset, and achieve state-of-the-art scores on intent classification and slot filling tasks. We also verify that the NLU network, pre-trained with Masked Language Model (MLM), can utilize a noisy textual representation of CTI. Moreover, we train our model with extra data, SLURP-Synth, and get better results.