Sorry, you need to enable JavaScript to visit this website.

facebooktwittermailshare

Video-Driven Speech Reconstruction - Show & Tell Demo

Abstract: 

This demo will showcase our video-to-audio model which attempts to reconstruct speech from short videos of spoken statements. Our model does so in a completely end-to-end manner where raw audio is generated based on the input video. This approach bypasses the need for separate lip-reading and text-to-speech models. The advantage of such an approach is that it does not require large transcribed datasets and it is not based on intermediate representations like text which remove any intonation and emotional content from the speech. This demo will show for the first time the feasibility of end-to-end video-driven speech reconstruction for unseen speakers. The model is based on generative adversarial networks and achieves the state-of-the-art performance on seen speakers on the GRID dataset in terms of word error rate and speech quality and intelligibility. It is also the first model which can generate high quality and intelligible speech for unseen speakers. Additionally, this model is the first to produce intelligible speech when trained and tested on LRW, an 'in the wild' dataset which contains thousands of utterances taken from television broadcasts. The demo will be interactive, involving recording live video from a new participant. The previously unseen speaker will be asked to utter a short sentence in front of the camera, but no audio will be recorded. This video will then be fed into the model and it will (in only a few seconds) produce a new version of the same video which will feature the reproduced speech generated by our end-to-end model. The proposed model can have a significant impact on videoconferencing by alleviating common issues such as noisy environments, gaps in the audio and unvoiced syllables. The demo will be the first step in demonstrating the potential of this technology which we believe will be very attractive and relevant to the ICASSP audience. Samples of our work can be found on https://sites.google.com/view/speech-synthesis/home/extension .

up
0 users have voted:

Paper Details

Authors:
Konstantinos Vougioukas, Stavros Petridis, Björn Schuller, Maja Pantic
Submitted On:
15 May 2020 - 11:55am
Short Link:
Type:
Presentation Slides
Event:
Presenter's Name:
Rodrigo Mira
Paper Code:
6236
Document Year:
2020
Cite

Document Files

Video-driven Speech Reconstruction using Generative Adversarial Networks Show & Tell Demo.pdf

(35)

Subscribe

[1] Konstantinos Vougioukas, Stavros Petridis, Björn Schuller, Maja Pantic, "Video-Driven Speech Reconstruction - Show & Tell Demo", IEEE SigPort, 2020. [Online]. Available: http://sigport.org/5351. Accessed: Jul. 13, 2020.
@article{5351-20,
url = {http://sigport.org/5351},
author = {Konstantinos Vougioukas; Stavros Petridis; Björn Schuller; Maja Pantic },
publisher = {IEEE SigPort},
title = {Video-Driven Speech Reconstruction - Show & Tell Demo},
year = {2020} }
TY - EJOUR
T1 - Video-Driven Speech Reconstruction - Show & Tell Demo
AU - Konstantinos Vougioukas; Stavros Petridis; Björn Schuller; Maja Pantic
PY - 2020
PB - IEEE SigPort
UR - http://sigport.org/5351
ER -
Konstantinos Vougioukas, Stavros Petridis, Björn Schuller, Maja Pantic. (2020). Video-Driven Speech Reconstruction - Show & Tell Demo. IEEE SigPort. http://sigport.org/5351
Konstantinos Vougioukas, Stavros Petridis, Björn Schuller, Maja Pantic, 2020. Video-Driven Speech Reconstruction - Show & Tell Demo. Available at: http://sigport.org/5351.
Konstantinos Vougioukas, Stavros Petridis, Björn Schuller, Maja Pantic. (2020). "Video-Driven Speech Reconstruction - Show & Tell Demo." Web.
1. Konstantinos Vougioukas, Stavros Petridis, Björn Schuller, Maja Pantic. Video-Driven Speech Reconstruction - Show & Tell Demo [Internet]. IEEE SigPort; 2020. Available from : http://sigport.org/5351