End-to-end Detection of Attacks to Automatic Speaker Recognizers with Time-attentive Light Convolutional Neural Networks

In this contribution, we introduce convolutional neural network architectures aiming at performing end-to-end detection of attacks to voice biometrics systems, i.e. the model provides scores corresponding to the likelihood of attack given general purpose time-frequency features obtained from speech. Microphone level attackers based on speech synthesis and voice conversion techniques are considered, along with presentation replay attacks. While the convolutional models yield a sequence of representations corresponding to different parts of the input at varying time steps, concatenated first- and second-order statistics pooled from the outputs of a self-attention layer are used as a fixed-dimension representations of utterances of varying length, which are then input into a set of fully connected layers to finally yield scores. Evaluation of the proposed framework is performed with data from ASVspoof 2019 challenge yielding relative improvements higher than one order of magnitude in terms of equal error rate over two baseline systems provided by ASVspoof 2019's organizers, and significant improvements over the benchmark systems we evaluated.

MLSP_E2E_SpoofingDetection.pdf

MLSP_E2E_SpoofingDetection.pdf (531)

Thumbs Up

CITE

Documents

Poster

End-to-end Detection of Attacks to Automatic Speaker Recognizers with Time-attentive Light Convolutional Neural Networks

MLSP_E2E_SpoofingDetection.pdf

QUESTIONS?