Sorry, you need to enable JavaScript to visit this website.

[slides] Generation-Based Target Speech Extraction with Speech Discretization and Vocoder

DOI:
10.60864/07jf-nk30
Citation Author(s):
Submitted by:
Wangyou Zhang
Last updated:
6 June 2024 - 10:21am
Document Type:
Presentation Slides
Document Year:
2024
Event:
Presenters:
Wangyou Zhang
Paper Code:
SLP-L23.6
 

Target speech extraction (TSE) is a task aiming at isolating the speech of a specific target speaker from an audio mixture, with the help of an auxiliary recording of that target speaker. Most existing TSE methods employ discrimination-based models to estimate the target speaker’s proportion in the mixture, but they often fail to compensate for the missing or highly corrupted frequency components in the speech signal. In contrast, the generation-based methods can naturally handle such scenarios via speech resynthesis. In this paper, we propose a novel discrete token based TSE approach by combining state-of-the-art speech discretization and vocoder techniques. By predicting a sequence of discrete tokens with the auxiliary audio and employing a vocoder that takes discrete tokens as input, the target speech can be effectively re-synthesized while eliminating interference. Our experiments conducted on the WSJ0-2mix and Libri2mix datasets demonstrate that our proposed method yields high-quality target speech without interference.

up
0 users have voted: