Sorry, you need to enable JavaScript to visit this website.

Video-Driven Speech Reconstruction - Show & Tell Demo

Citation Author(s):
Konstantinos Vougioukas, Stavros Petridis, Björn Schuller, Maja Pantic
Submitted by:
Rodrigo Mira
Last updated:
15 May 2020 - 11:55am
Document Type:
Presentation Slides
Document Year:
Presenters Name:
Rodrigo Mira
Paper Code:



This demo will showcase our video-to-audio model which attempts to reconstruct speech from short videos of spoken statements. Our model does so in a completely end-to-end manner where raw audio is generated based on the input video. This approach bypasses the need for separate lip-reading and text-to-speech models. The advantage of such an approach is that it does not require large transcribed datasets and it is not based on intermediate representations like text which remove any intonation and emotional content from the speech. This demo will show for the first time the feasibility of end-to-end video-driven speech reconstruction for unseen speakers. The model is based on generative adversarial networks and achieves the state-of-the-art performance on seen speakers on the GRID dataset in terms of word error rate and speech quality and intelligibility. It is also the first model which can generate high quality and intelligible speech for unseen speakers. Additionally, this model is the first to produce intelligible speech when trained and tested on LRW, an 'in the wild' dataset which contains thousands of utterances taken from television broadcasts. The demo will be interactive, involving recording live video from a new participant. The previously unseen speaker will be asked to utter a short sentence in front of the camera, but no audio will be recorded. This video will then be fed into the model and it will (in only a few seconds) produce a new version of the same video which will feature the reproduced speech generated by our end-to-end model. The proposed model can have a significant impact on videoconferencing by alleviating common issues such as noisy environments, gaps in the audio and unvoiced syllables. The demo will be the first step in demonstrating the potential of this technology which we believe will be very attractive and relevant to the ICASSP audience. Samples of our work can be found on .

0 users have voted:

Dataset Files

Video-driven Speech Reconstruction using Generative Adversarial Networks Show & Tell Demo.pdf