Sorry, you need to enable JavaScript to visit this website.

TB-RESNET: BRIDGING THE GAP FROM TDNN TO RESNET IN AUTOMATIC SPEAKER VERIFICATION WITH TEMPORAL-BOTTLENECK ENHANCEMENT

DOI:
10.60864/bm93-ae63
Citation Author(s):
Sunmook Choi, Sanghyeok Chung, Seungeun Lee, Soyul Han, Taein Kang, Jaejin Seo, Il-Youp Kwak, Seungsang Oh
Submitted by:
Seungeun Lee
Last updated:
6 June 2024 - 10:28am
Document Type:
Poster
Document Year:
2024
Event:
Presenters:
Seungeun Lee
Paper Code:
SLP-P24.9
 

This paper focuses on the transition of automatic speaker verification systems from time delay neural networks (TDNN) to ResNet-based networks. TDNN-based systems use a statistics pooling layer to aggregate temporal information which is suitable for two-dimensional tensors. Even though ResNet-based models produce three-dimensional tensors, they continue to incorporate the statistics pooling layer. However, the reduction in spatial dimensions in ResNet due to convolution operations, including the temporal axis, raises concerns about temporal information loss and its compatibility with statistics pooling. To address this, we introduce Temporal-Bottleneck ResNet (TB-ResNet), a ResNet-based system that can utilize the nature of statistics pooling more effectively by capturing and retaining frame-level contexts through a temporal bottleneck configuration in its building blocks. The performance of TB-ResNets outperforms the original ResNet counterparts on VoxCeleb1, achieving a significant reduction in both the equal error rate and the minimum detection cost function.

up
0 users have voted: