Sorry, you need to enable JavaScript to visit this website.

This paper introduces an innovative deep learning framework for parallel voice conversion to mitigate inherent risks associated with such systems. Our approach focuses on developing an invertible model capable of countering potential spoofing threats. Specifically, we present a conversion model that allows for the retrieval of source voices, thereby facilitating the identification of the source speaker. This framework is constructed using a series of invertible modules composed of affine coupling layers to ensure the reversibility of the conversion process.

Categories:
3 Views

___Although dated, this student thesis is re-published as the proposed negative feedback topology and the current mode arrangement of silicon bipolar junction transistors is rarely elaborated in the many excellent contemporary books on audio power amplifier design.

Categories:
65 Views

Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers.

Categories:
17 Views

Masked Autoencoders is a simple yet powerful self-supervised learning method. However, it learns representations indirectly by reconstructing masked input patches. Several methods learn representations directly by predicting representations of masked patches; however, we think using all patches to encode training signal representations is suboptimal. We propose a new method, Masked Modeling Duo (M2D), that learns representations directly while obtaining training signals using only masked patches.

Categories:
19 Views

Coherent processing of signals captured by a wireless acoustic sensor network (WASN) requires an estimation of such parameters as the sampling-rate and sampling-time offset (SRO and STO). The acquired asynchronous signals of such WASN exhibit an accumulating time drift (ATD) linearly growing with time and dependent on SRO and STO values. In our demonstration, we present a real WASN based on Respberry-Pi computers, where SRO and ATD values are estimated by using a double-cross-correlation processor with phase transfrom (DXCP-PhaT) recently proposed.

Categories:
121 Views

While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlapping inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when over- lapping percentage decreases. Setting aside computational cost, the setup with 50% overlapping during inference can achieve the best performance. However, a lower overlapping percentage has an advantage of fast inference speed.

Categories:
22 Views

Various attention mechanisms are being widely applied to acoustic scene classification. However, we empirically found that the attention mechanism can excessively discard potentially valuable information, despite improving performance. We propose the attentive max feature map that combines two effective techniques, attention and a max feature map, to further elaborate the attention mechanism and mitigate the above-mentioned phenomenon. We also explore various joint training methods, including multi-task learning, that allocate additional abstract labels for each audio recording.

Categories:
13 Views

Pages