Sorry, you need to enable JavaScript to visit this website.

facebooktwittermailshare

Toward Visual Voice Activity Detection for Unconstrained Videos

Abstract: 

The prevalent audio-based Voice Activity Detection (VAD) systems are challenged by the presence of ambient noise and are sensitive to variations in the type of the noise. The use of information from the visual modality, when available, can help overcome some of the problems of audio-based VAD. Existing visual-VAD systems however do not operate directly on the whole image but require intermediate face detection, face landmark detection and subsequent facial feature extraction from the lip region. In this work, we present an end-to-end trainable Hierarchical Context Aware (HiCA) architecture for visual-VAD for videos obtained in unconstrained environments which can be trained with videos as input and audio speech labels as output. The network is designed to account for local and global temporal information in a video sequence. In contrast to existing visual-VAD systems our proposed approach does not rely on
face detection and subsequent facial feature extraction. It can obtain a VAD accuracy of 66% on a dataset of Hollywood movie videos just with visual information. Further analysis of the representations learned from our visual-VAD system shows that the network learns to localize on human faces, and sometimes speaking human
faces specifically. Our quantitative analysis of the effectiveness of face localization shows that our system performs better than sound localization networks designed for unconstrained videos.

up
0 users have voted:

Paper Details

Authors:
Rahul Sharma, Krishna Somandepalli and Shrikanth Narayanan
Submitted On:
19 September 2019 - 11:55am
Short Link:
Type:
Poster
Event:
Presenter's Name:
Rahul Sharma
Paper Code:
3434
Document Year:
2019
Cite

Document Files

Poster Presentation

(14)

Subscribe

[1] Rahul Sharma, Krishna Somandepalli and Shrikanth Narayanan, "Toward Visual Voice Activity Detection for Unconstrained Videos", IEEE SigPort, 2019. [Online]. Available: http://sigport.org/4741. Accessed: Oct. 18, 2019.
@article{4741-19,
url = {http://sigport.org/4741},
author = {Rahul Sharma; Krishna Somandepalli and Shrikanth Narayanan },
publisher = {IEEE SigPort},
title = {Toward Visual Voice Activity Detection for Unconstrained Videos},
year = {2019} }
TY - EJOUR
T1 - Toward Visual Voice Activity Detection for Unconstrained Videos
AU - Rahul Sharma; Krishna Somandepalli and Shrikanth Narayanan
PY - 2019
PB - IEEE SigPort
UR - http://sigport.org/4741
ER -
Rahul Sharma, Krishna Somandepalli and Shrikanth Narayanan. (2019). Toward Visual Voice Activity Detection for Unconstrained Videos. IEEE SigPort. http://sigport.org/4741
Rahul Sharma, Krishna Somandepalli and Shrikanth Narayanan, 2019. Toward Visual Voice Activity Detection for Unconstrained Videos. Available at: http://sigport.org/4741.
Rahul Sharma, Krishna Somandepalli and Shrikanth Narayanan. (2019). "Toward Visual Voice Activity Detection for Unconstrained Videos." Web.
1. Rahul Sharma, Krishna Somandepalli and Shrikanth Narayanan. Toward Visual Voice Activity Detection for Unconstrained Videos [Internet]. IEEE SigPort; 2019. Available from : http://sigport.org/4741