Sorry, you need to enable JavaScript to visit this website.

A novel speaker segmentation approach based on deep neural network is proposed and investigated. This approach uses deep speaker vectors (d-vectors) to represent speaker characteristics and to find speaker change points. The d-vector is a kind of frame-level speaker recognition feature, whose discriminative training process corresponds to the goal of discriminating a speaker change point from a single speaker speech segment in a short time window.


The universal speech attributes for speaker verification (SV)
are addressed in this paper. The aim of this work is to
exploit fundamental characteristics across different speakers
within the deep neural network (DNN)/i-vector framework.
The manner and place of articulation form the fundamental
speech attribute unit inventory, and new attribute units for
acoustic modelling are generated by a two-step automatic
clustering method in this paper. The DNN based on
universal attribute units is used to generate posterior


The widely adopted i-vector performances well in text-independent speaker verification with long speech duration. How to integrate the state-of-the-art i-vector framework into the text-prompted speaker verification is addressed in this paper. To take advantage of the lexical information and enhance the performance for speaker verification with random digit sequences, this paper proposes to extract a set of digit-dependent local i-vectors from the utterance instead of extracting a single i-vector. The digit-dependent local i-vector is considered


An i-vector has become the state-of-the-art algorithm for text-independent recognition. Most of related works take the extraction of the i-vector as a black-box by using some open software (e.g. Kaldi, Alize) and focus on the vector-based back-end algorithms, such as length normalization, WCCN, or PLDA. In this paper, we study the variational method and present a concise derivation for the i-vector. Based on our proposed methods, three criteria for derivation are compared. There are maximum likelihood (ML), maximum a posteriori (MAP) and maximum


The universal speech attributes to speaker verification (SV) is addressed in this paper. The manner and place of articulation form the universal attribute unit inventory, and deep neural network (DNN) is used as acoustic model.


High performance diarisation is a necessity for a variety of applications, and the task has been
studied extensively in the context of broadcast news and meeting processing. Upon introduction of
the task in NIST led evaluations, diarisation error rate (DER) was introduced as the standard metric
for evaluation, and it has been consistently used to compare systems ever since. DER is a frame
based metric that does not penalise for producing many short segments. However, practical systems


In this paper, automatic speaker verification using normal and whispered speech is explored. Typically, for speaker verification systems with varying vocal effort inputs, standard solutions such as feature mapping or addition of data during parameter estimation (training) and enrollment stages result in a trade-off between accuracy gains with whispered test data and accuracy losses (up to 70% in equal error rate, EER) with normal test data. To overcome this shortcoming, this paper proposes two innovations.