Real-Time Target Sound Extraction

Citation Author(s):: Bandhav Veluri

Bandhav Veluri, Justin Chan, Malek Itani, Tuochao Chen, Takuya Yoshioka, Shyamnath Gollakota
Submitted by:: Bandhav Veluri
Last updated:: 23 May 2023 - 1:32am
Document Type:: Research Manuscript

Categories:: Source separation (MLR-SSEP)

We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner, while also leveraging the generalization performance of transformer-based architectures. Our evaluations show as much as 2.2–3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2–4x smaller model size and a 1.5–2x lower runtime. We provide code, dataset, and audio samples: https://waveformer.cs.washington.edu/.

SemAudioCamReady.pdf

SemAudioCamReady.pdf (272)

Thumbs Up

CITE

Documents

Research Manuscript

Real-Time Target Sound Extraction

SemAudioCamReady.pdf

QUESTIONS?