Sorry, you need to enable JavaScript to visit this website.

In this paper, we address the problem of speaker recognition in challenging acoustic conditions using a novel method to extract robust speaker-discriminative speech representations. We adopt a recently proposed unsupervised adversarial invariance architecture to train a network that maps speaker embeddings extracted using a pre-trained model onto two lower dimensional embedding spaces. The embedding spaces are learnt to disentangle speaker-discriminative information from all other information present in the audio recordings, without supervision about the acoustic conditions.

Categories:
17 Views

This paper presents an improved deep embedding learning method based on a convolutional neural network (CNN) for text-independent speaker verification. Two improvements are proposed for x-vector embedding learning: (1) a multiscale convolution (MSCNN) is adopted in the frame-level layers to capture the complementary speaker information in different receptive fields; (2) a Baum-Welch statistics attention (BWSA) mechanism is applied in the pooling layer, which can integrate more useful long-term speaker characteristics in the temporal pooling layer.

Categories:
22 Views

In this contribution, we introduce convolutional neural network architectures aiming at performing end-to-end detection of attacks to voice biometrics systems, i.e. the model provides scores corresponding to the likelihood of attack given general purpose time-frequency features obtained from speech. Microphone level attackers based on speech synthesis and voice conversion techniques are considered, along with presentation replay attacks.

Categories:
96 Views

In this paper, the importance of analytic phase of the speech signal in automatic speaker verification systems is demonstrated in the context of replay spoof attacks. In order to accurately detect the replay spoof attacks, effective feature representations of speech signals are required to capture the distortion introduced due to the intermediate playback/recording devices, which is convolutive in nature.

Categories:
84 Views

Speech signal contains intrinsic and extrinsic variations such as accent, emotion, dialect, phoneme, speaking manner, noise, music, and reverberation. Some of these variations are unnecessary and are unspecified factors of variation. These factors lead to increased variability in speaker representation. In this paper, we assume that unspecified factors of variation exist in speaker representations, and we attempt to minimize variability in speaker representation.

Categories:
563 Views

Speaker diarisation systems often cluster audio segments using speaker embeddings such as i-vectors and d-vectors. Since different types of embeddings are often complementary, this paper proposes a generic framework to improve performance by combining them into a single embedding, referred to as a c-vector. This combination uses a 2-dimensional (2D) self-attentive structure, which extends the standard self-attentive layer by averaging not only across time but also across different types of embeddings.

Categories:
11 Views

An attacker may use a variety of techniques to fool an automatic speaker verification system into accepting them as a genuine user. Anti-spoofing methods meanwhile aim to make the system robust against such attacks. The ASVspoof 2017 Challenge focused specifically on replay attacks, with the intention of measuring the limits of replay attack detection as well as developing countermeasures against them.

Categories:
9 Views

We propose a Denoising Autoencoder (DAE) for speaker recognition, trained to map each individual ivector to the mean of all ivectors belonging to that particular speaker. The aim of this DAE is to compensate for inter-session variability and increase the discriminative power of the ivectors prior to PLDA scoring. We test the proposed approach on the MCE 2018 1st Multi-target speaker detection and identification Challenge Evaluation. This evaluation presents a call-center fraud detection scenario: given a speech segment, detect if it belongs to any of the speakers in a blacklist.

Categories:
57 Views

This paper aims to improve the widely used deep speaker embedding x-vector model. We propose the following improvements: (1) a hybrid neural network structure using both time delay neural network (TDNN) and long short-term memory neural networks (LSTM) to generate complementary speaker information at different levels; (2) a multi-level pooling strategy to collect speaker information from both TDNN and LSTM layers; (3) a regularization scheme on the speaker embedding extraction layer to make the extracted embeddings suitable for the following fusion step.

Categories:
9 Views

Pages