Sorry, you need to enable JavaScript to visit this website.

Statistics Pooling Time Delay Neural Network Based on X-vector for Speaker Verification

Citation Author(s):
Qian-Bei Hong, Chung-Hsien Wu, Hsin-Min Wang, Chien-Lin Huang
Submitted by:
Chung-Hsien Wu
Last updated:
15 May 2020 - 11:55pm
Document Type:
Presentation Slides
Document Year:



This paper aims to improve speaker embedding representation based on x-vector for extracting more detailed information for speaker verification. We propose a statistics pooling time delay neural network (TDNN), in which the TDNN structure integrates statistics pooling for each layer, to consider the variation of temporal context in frame-level transformation. The proposed feature vector, named as stats-vector, are compared with the baseline x-vector features on the VoxCeleb dataset and the Speakers in the Wild (SITW) dataset for speaker verification. The experimental results showed that the proposed stats-vector with score fusion achieved the best performance on VoxCeleb1 dataset. Furthermore, considering the interference from other speakers in the recordings, we found that the proposed stats-vector efficiently reduced the interference and improved the speaker verification performance on the SITW dataset.

0 users have voted:

Dataset Files

20200419_ICASSP_Experiment 1.pdf