Sorry, you need to enable JavaScript to visit this website.

RSR2015 (Robust Speaker Recognition 2015) is the largest publicly available speech corpus for text-dependent robust speaker recognition. The current release includes 151 hours of short duration utterances spoken by 300 speakers. RSR2015 is developed by the Human Language Technology (HLT) department at Institute for Infocomm Research (I2R) in Singapore. This newsletter describes RSR2015 corpus that addresses the reviving interest of text-dependent speaker recognition.


This paper describes the multi-query multi-head attention (MQMHA)
pooling and inter-topK penalty methods which were first proposed in
our submitted system description for VoxCeleb speaker recognition
challenge (VoxSRC) 2021. Most multi-head attention pooling mechanisms either attend to the whole feature through multiple heads or
attend to several split parts of the whole feature. Our proposed
MQMHA combines both these two mechanisms and gain more
diversified information. The margin-based softmax loss functions


Despite achieving satisfactory performance in speaker verification using deep neural networks, variable-duration utterances remain a challenge that threatens the robustness of systems. To deal with this issue, we propose a speaker verification system called RawNeXt that can handle input raw waveforms of arbitrary length by employing the following two components: (1) A deep layer aggregation strategy enhances speaker information by iteratively and hierarchically aggregating features of various time scales and spectral channels output from blocks.


The objective of this paper is to combine multiple frame-level features into a single utterance-level representation considering pair wise relationships. For this purpose, we propose a novel graph attentive feature aggregation module by interpreting each frame-level feature as a node of a graph. The inter-relationship between all possible pairs of features, typically exploited indirectly, can be directly modeled using a graph. The module comprises a graph attention layer and a graph pooling layer followed by a readout operation.


In this paper, we conduct a cross-dataset study on parametric and non-parametric raw-waveform based speaker embeddings through speaker verification experiments. In general, we observe a more significant performance degradation of these raw-waveform systems compared to spectral based systems. We then propose two strategies to improve the performance of raw-waveform based systems on cross-dataset tests. The first strategy is to change the real-valued filters into analytic filters to ensure shift-invariance.