Documents
Poster
A STUDY OF MULTICHANNEL SPATIOTEMPORAL FEATURES AND KNOWLEDGE DISTILLATION ON ROBUST TARGET SPEAKER EXTRACTION
- DOI:
- 10.60864/nrtq-a875
- Citation Author(s):
- Submitted by:
- YICHI WANG
- Last updated:
- 6 June 2024 - 10:28am
- Document Type:
- Poster
- Categories:
- Log in to post comments
Target speaker extraction (TSE) based on direction of arrival (DOA) has a wide range of applications in e.g., remote conferencing, hearing aids, in-car speech interaction. Due to the inherent phase uncertainty, existing TSE methods usually suffer from speaker confusion within specific frequency bands. Imprecise DOA measurements caused by e.g., the calibration of the microphone array and ambient noises, can also deteriorate the TSE performance. In order to improve the robustness of TSE, in this work we propose several new multichannel spatiotemporal features to represent the discriminability of the target speaker. The narrow-band Conformer model is applied in combination with the proposed features to facilitate the extraction of the target speaker. In addition, we consider knowledge distillation for improving the model robustness, particularly in the presence of DOA mis-match. Experimental results on a public dataset verify the efficacy of the proposed method.