Sorry, you need to enable JavaScript to visit this website.

TIME-DOMAIN AUDIO-VISUAL SPEECH SEPARATION ON LOW QUALITY VIDEOS

Citation Author(s):
Yifei Wu, Chenda Li, Jinfeng Bai, Zhongqin Wu, Yanmin Qian
Submitted by:
Yifei Wu
Last updated:
4 May 2022 - 11:05pm
Document Type:
Poster
Document Year:
2022
Event:
Presenters:
Yifei Wu
Paper Code:
AUD-6.6
 

Incorporating visual information is a promising approach to improve the performance of speech separation. Many related works have been conducted and provide inspiring results. However, low quality videos appear commonly in real scenarios, which may significantly degrade the performance of normal audio-visual speech separation system. In this paper, we propose a new structure to fuse the audio and visual features, which uses the audio feature to select relevant visual features by utilizing the attention mechanism. A Conv-TasNet based model is combined with the proposed attention-based multi-modal fusion, trained with proper data augmentation and evaluated with 3 categories of low quality videos. The experimental results show that our system outperforms the baseline which simply concatenates the audio and visual features when training with normal or low quality data, and is robust to low quality video inputs at inference time.

up
0 users have voted: