Documents
Presentation Slides
[slides] Generation-Based Target Speech Extraction with Speech Discretization and Vocoder
- DOI:
- 10.60864/07jf-nk30
- Citation Author(s):
- Submitted by:
- Wangyou Zhang
- Last updated:
- 6 June 2024 - 10:21am
- Document Type:
- Presentation Slides
- Document Year:
- 2024
- Event:
- Presenters:
- Wangyou Zhang
- Paper Code:
- SLP-L23.6
- Categories:
- Log in to post comments
Target speech extraction (TSE) is a task aiming at isolating the speech of a specific target speaker from an audio mixture, with the help of an auxiliary recording of that target speaker. Most existing TSE methods employ discrimination-based models to estimate the target speaker’s proportion in the mixture, but they often fail to compensate for the missing or highly corrupted frequency components in the speech signal. In contrast, the generation-based methods can naturally handle such scenarios via speech resynthesis. In this paper, we propose a novel discrete token based TSE approach by combining state-of-the-art speech discretization and vocoder techniques. By predicting a sequence of discrete tokens with the auxiliary audio and employing a vocoder that takes discrete tokens as input, the target speech can be effectively re-synthesized while eliminating interference. Our experiments conducted on the WSJ0-2mix and Libri2mix datasets demonstrate that our proposed method yields high-quality target speech without interference.