Sorry, you need to enable JavaScript to visit this website.

In this paper, we propose a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient than our previous tuple-based end-to-end (TE2E) loss function. Unlike TE2E, the GE2E loss function updates the network in a way that emphasizes examples that are difficult to verify at each step of the training process. Additionally, the GE2E loss does not require an initial stage of example selection.

Categories:
158 Views

A novel learnable dictionary encoding layer is proposed in this paper for end-to-end language identification. It is inline with the conventional GMM i-vector approach both theoretically and practically. We imitate the mechanism of traditional GMM training and Supervector encoding procedure on the top of CNN. The proposed layer can accumulate high-order statistics from variable-length input sequence and generate an utterance level fixed-dimensional vector representation.

Categories:
25 Views

A novel interpretable end-to-end learning scheme for language identification is proposed. It is in line with the classical GMM i-vector methods both theoretically and practically. In the end-to-end pipeline, a general encoding layer is employed on top of the front-end CNN, so that it can encode the variable-length input sequence into an utterance level vector automatically. After comparing with the state-of-the-art GMM i-vector methods, we give insights into CNN, and reveal its role and effect in the whole pipeline.

Categories:
22 Views

This paper deals with far-field speaker recognition. On a corpus of NIST SRE 2010 data retransmitted in a real room with multiple microphones, we first demonstrate how room acoustics cause significant degradation of state-of-the-art i-vector based speaker recognition system. We then investigate several techniques to improve the performances ranging from probabilistic linear discriminant analysis (PLDA) re-training, through dereverberation, to beamforming.

Categories:
50 Views

Recently, hierarchical language identification systems have shown significant improvement over single level systems in both closed and open set language identification tasks. However, developing such a system requires the features and classifier selection at each node in the hierarchical structure to be hand crafted. Motivated by the superior ability of end-to-end deep neural network architecture to jointly optimize the feature extraction and classification process, we propose a novel approach developing an end-to-end hierarchical language identification system.

Categories:
34 Views

Performance estimation is crucial to the assessment of novel algorithms and systems. In detection error trade-off (DET) diagrams, discrimination performance is solely assessed targeting one application, where cross-application performance considers risks resulting from decisions, depending on application constraints. For the purpose of interchangeability of research results across different application constraints, we propose to augment DET curves by depicting systems regarding their support of security and convenience levels.

Categories:
63 Views

For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization.

Categories:
41 Views

Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights.

Categories:
153 Views

The objective of this paper is to extract robust features for
detecting replay spoof attacks on text-independent speaker
verification systems. In the case of replay attacks, prere-
corded utterance of the target speaker is played to the auto-
matic speaker verification system (ASV)to gain unauthorized
access. In such a scenario, the speech signal carries the char-
acteristics of the intermediate recording device as well. In the
proposed approach, the characteristics of the intermediate de-

Categories:
18 Views

Pages