[slides] Generation-Based Target Speech Extraction with Speech Discretization and Vocoder

Target speech extraction (TSE) is a task aiming at isolating the speech of a specific target speaker from an audio mixture, with the help of an auxiliary recording of that target speaker. Most existing TSE methods employ discrimination-based models to estimate the target speaker’s proportion in the mixture, but they often fail to compensate for the missing or highly corrupted frequency components in the speech signal. In contrast, the generation-based methods can naturally handle such scenarios via speech resynthesis. In this paper, we propose a novel discrete token based TSE approach by combining state-of-the-art speech discretization and vocoder techniques. By predicting a sequence of discrete tokens with the auxiliary audio and employing a vocoder that takes discrete tokens as input, the target speech can be effectively re-synthesized while eliminating interference. Our experiments conducted on the WSJ0-2mix and Libri2mix datasets demonstrate that our proposed method yields high-quality target speech without interference.

slides_icassp_discrete_tse_oral.pdf

slides_icassp_discrete_tse_oral.pdf (376)

Thumbs Up

CITE

Documents

Presentation Slides

[slides] Generation-Based Target Speech Extraction with Speech Discretization and Vocoder

slides_icassp_discrete_tse_oral.pdf

QUESTIONS?