Documents
Poster
TB-RESNET: BRIDGING THE GAP FROM TDNN TO RESNET IN AUTOMATIC SPEAKER VERIFICATION WITH TEMPORAL-BOTTLENECK ENHANCEMENT
- DOI:
- 10.60864/bm93-ae63
- Citation Author(s):
- Submitted by:
- Seungeun Lee
- Last updated:
- 6 June 2024 - 10:28am
- Document Type:
- Poster
- Document Year:
- 2024
- Event:
- Presenters:
- Seungeun Lee
- Paper Code:
- SLP-P24.9
- Categories:
- Keywords:
- Log in to post comments
This paper focuses on the transition of automatic speaker verification systems from time delay neural networks (TDNN) to ResNet-based networks. TDNN-based systems use a statistics pooling layer to aggregate temporal information which is suitable for two-dimensional tensors. Even though ResNet-based models produce three-dimensional tensors, they continue to incorporate the statistics pooling layer. However, the reduction in spatial dimensions in ResNet due to convolution operations, including the temporal axis, raises concerns about temporal information loss and its compatibility with statistics pooling. To address this, we introduce Temporal-Bottleneck ResNet (TB-ResNet), a ResNet-based system that can utilize the nature of statistics pooling more effectively by capturing and retaining frame-level contexts through a temporal bottleneck configuration in its building blocks. The performance of TB-ResNets outperforms the original ResNet counterparts on VoxCeleb1, achieving a significant reduction in both the equal error rate and the minimum detection cost function.