Sorry, you need to enable JavaScript to visit this website.

A STUDY OF MULTICHANNEL SPATIOTEMPORAL FEATURES AND KNOWLEDGE DISTILLATION ON ROBUST TARGET SPEAKER EXTRACTION

DOI:
10.60864/nrtq-a875
Citation Author(s):
Submitted by:
YICHI WANG
Last updated:
6 June 2024 - 10:28am
Document Type:
Poster
 

Target speaker extraction (TSE) based on direction of arrival (DOA) has a wide range of applications in e.g., remote conferencing, hearing aids, in-car speech interaction. Due to the inherent phase uncertainty, existing TSE methods usually suffer from speaker confusion within specific frequency bands. Imprecise DOA measurements caused by e.g., the calibration of the microphone array and ambient noises, can also deteriorate the TSE performance. In order to improve the robustness of TSE, in this work we propose several new multichannel spatiotemporal features to represent the discriminability of the target speaker. The narrow-band Conformer model is applied in combination with the proposed features to facilitate the extraction of the target speaker. In addition, we consider knowledge distillation for improving the model robustness, particularly in the presence of DOA mis-match. Experimental results on a public dataset verify the efficacy of the proposed method.

up
0 users have voted: