MDRT: MULTI-DOMAIN SYNTHETIC SPEECH LOCALIZATION

With recent advancements in generating synthetic speech, tools to generate high-quality synthetic speech impersonating any human speaker are easily available. Several incidents report misuse of high-quality synthetic speech for spreading misinformation and for large-scale financial frauds. Many methods have been proposed for detecting synthetic speech; however, there is limited work on localizing the synthetic segments within the speech signal. In this work, our goal is to localize the synthetic speech segments in a partially synthetic speech signal. Most existing methods for synthetic speech localization obtain features from either the time domain waveform or the spectrogram representation of the speech signal. In this work, we propose Multi-Domain ResNet Transformer (MDRT) that obtains multi-domain features from both the time domain and the spectrogram representation of a speech signal to localize synthetic speech segments. MDRT uses transformer neural networks to obtain multi-domain features and processes them using a ResNet-style neural network. We use the PartialSpoof dataset to examine the performance of MDRT on localizing synthetic speech segments of varying duration. Our results show that MDRT performs better than several existing synthetic speech localization methods.

mdrt_v05.pdf

mdrt_v05.pdf (200)

Links:

DOI

Thumbs Up

CITE

Documents

Poster

MDRT: MULTI-DOMAIN SYNTHETIC SPEECH LOCALIZATION

mdrt_v05.pdf

QUESTIONS?