Sorry, you need to enable JavaScript to visit this website.

Audio for Multimedia

Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks


Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding).

Paper Details

Authors:
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano, Kevin McGuinness, Jordi Torres, Xavier Giro-i-Nieto
Submitted On:
10 May 2019 - 1:12pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

slides

(15)

Subscribe

[1] Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano, Kevin McGuinness, Jordi Torres, Xavier Giro-i-Nieto, "Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks", IEEE SigPort, 2019. [Online]. Available: http://sigport.org/4376. Accessed: Jun. 26, 2019.
@article{4376-19,
url = {http://sigport.org/4376},
author = {Amanda Duarte; Francisco Roldan; Miquel Tubau; Janna Escur; Santiago Pascual; Amaia Salvador; Eva Mohedano; Kevin McGuinness; Jordi Torres; Xavier Giro-i-Nieto },
publisher = {IEEE SigPort},
title = {Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks},
year = {2019} }
TY - EJOUR
T1 - Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks
AU - Amanda Duarte; Francisco Roldan; Miquel Tubau; Janna Escur; Santiago Pascual; Amaia Salvador; Eva Mohedano; Kevin McGuinness; Jordi Torres; Xavier Giro-i-Nieto
PY - 2019
PB - IEEE SigPort
UR - http://sigport.org/4376
ER -
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano, Kevin McGuinness, Jordi Torres, Xavier Giro-i-Nieto. (2019). Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks. IEEE SigPort. http://sigport.org/4376
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano, Kevin McGuinness, Jordi Torres, Xavier Giro-i-Nieto, 2019. Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks. Available at: http://sigport.org/4376.
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano, Kevin McGuinness, Jordi Torres, Xavier Giro-i-Nieto. (2019). "Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks." Web.
1. Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano, Kevin McGuinness, Jordi Torres, Xavier Giro-i-Nieto. Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks [Internet]. IEEE SigPort; 2019. Available from : http://sigport.org/4376

A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling


Sound event detection (SED) entails two subtasks: recognizing what types of sound events are present in an audio stream (audio tagging), and pinpointing their onset and offset times (localization). In the popular multiple instance learning (MIL) framework for SED with weak labeling, an important component is the pooling function. This paper compares five types of pooling functions both theoretically and experimentally, with special focus on their performance of localization.

Paper Details

Authors:
Yun Wang, Juncheng Li, Florian Metze
Submitted On:
9 May 2019 - 12:55pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

2019.05 Slides for ICASSP.pdf

(8)

Subscribe

[1] Yun Wang, Juncheng Li, Florian Metze, "A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling", IEEE SigPort, 2019. [Online]. Available: http://sigport.org/4224. Accessed: Jun. 26, 2019.
@article{4224-19,
url = {http://sigport.org/4224},
author = {Yun Wang; Juncheng Li; Florian Metze },
publisher = {IEEE SigPort},
title = {A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling},
year = {2019} }
TY - EJOUR
T1 - A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling
AU - Yun Wang; Juncheng Li; Florian Metze
PY - 2019
PB - IEEE SigPort
UR - http://sigport.org/4224
ER -
Yun Wang, Juncheng Li, Florian Metze. (2019). A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling. IEEE SigPort. http://sigport.org/4224
Yun Wang, Juncheng Li, Florian Metze, 2019. A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling. Available at: http://sigport.org/4224.
Yun Wang, Juncheng Li, Florian Metze. (2019). "A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling." Web.
1. Yun Wang, Juncheng Li, Florian Metze. A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling [Internet]. IEEE SigPort; 2019. Available from : http://sigport.org/4224

Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling


Research on sound event detection (SED) with weak labeling has mostly focused on presence/absence labeling, which provides no temporal information at all about the event occurrences. In this paper, we consider SED with sequential labeling, which specifies the temporal order of the event boundaries. The conventional connectionist temporal classification (CTC) framework, when applied to SED with sequential labeling, does not localize long events well due to a "peak clustering" problem.

Paper Details

Authors:
Yun Wang, Florian Metze
Submitted On:
8 May 2019 - 11:49pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

Poster.pdf

(898)

Subscribe

[1] Yun Wang, Florian Metze, "Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling", IEEE SigPort, 2019. [Online]. Available: http://sigport.org/4144. Accessed: Jun. 26, 2019.
@article{4144-19,
url = {http://sigport.org/4144},
author = {Yun Wang; Florian Metze },
publisher = {IEEE SigPort},
title = {Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling},
year = {2019} }
TY - EJOUR
T1 - Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling
AU - Yun Wang; Florian Metze
PY - 2019
PB - IEEE SigPort
UR - http://sigport.org/4144
ER -
Yun Wang, Florian Metze. (2019). Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling. IEEE SigPort. http://sigport.org/4144
Yun Wang, Florian Metze, 2019. Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling. Available at: http://sigport.org/4144.
Yun Wang, Florian Metze. (2019). "Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling." Web.
1. Yun Wang, Florian Metze. Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling [Internet]. IEEE SigPort; 2019. Available from : http://sigport.org/4144

FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING


Audio fingerprinting systems are often well designed to cope with a range of broadband noise types however they cope less well when presented with additive noise containing sinusoidal components. This is largely due to the fact that in a short-time signal representa- tion (over periods of ≈ 20ms) these noise components are largely indistinguishable from salient components of the desirable signal that is to be fingerprinted.

Paper Details

Authors:
Submitted On:
30 April 2018 - 7:27pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

Draft_v2.pdf

(139)

Subscribe

[1] , "FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/3197. Accessed: Jun. 26, 2019.
@article{3197-18,
url = {http://sigport.org/3197},
author = { },
publisher = {IEEE SigPort},
title = {FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING},
year = {2018} }
TY - EJOUR
T1 - FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING
AU -
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/3197
ER -
. (2018). FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING. IEEE SigPort. http://sigport.org/3197
, 2018. FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING. Available at: http://sigport.org/3197.
. (2018). "FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING." Web.
1. . FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/3197

FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING


Audio fingerprinting systems are often well designed to cope with a range of broadband noise types however they cope less well when presented with additive noise containing sinusoidal components. This is largely due to the fact that in a short-time signal representa- tion (over periods of ≈ 20ms) these noise components are largely indistinguishable from salient components of the desirable signal that is to be fingerprinted.

Paper Details

Authors:
Submitted On:
30 April 2018 - 7:27pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

Draft_v2.pdf

(150)

Subscribe

[1] , "FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/3196. Accessed: Jun. 26, 2019.
@article{3196-18,
url = {http://sigport.org/3196},
author = { },
publisher = {IEEE SigPort},
title = {FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING},
year = {2018} }
TY - EJOUR
T1 - FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING
AU -
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/3196
ER -
. (2018). FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING. IEEE SigPort. http://sigport.org/3196
, 2018. FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING. Available at: http://sigport.org/3196.
. (2018). "FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING." Web.
1. . FOREGROUND HARMONIC NOISE REDUCTION FOR ROBUST AUDIO FINGERPRINTING [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/3196

Depression Speaks: Automatic Discrimination Between Depressed and Non-Depressed Speakers Based on Nonverbal Speech Features


This article proposes an automatic approach - based on nonverbal speech features - aimed at the automatic discrimination between depressed and non-depressed speakers. The experiments have been performed over one of the largest corpora collected for such a task in the literature ($62$ patients diagnosed with depression and $54$ healthy control subjects), especially when it comes to data where the depressed speakers have been diagnosed as such by professional psychiatrists.

Paper Details

Authors:
F.Scibelli, G.Roffo, M.Tayarani, L.Bartoli, G.De Mattia, A.Esposito and A.Vinciarelli
Submitted On:
19 April 2018 - 2:20pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

icassp.pdf

(596)

Subscribe

[1] F.Scibelli, G.Roffo, M.Tayarani, L.Bartoli, G.De Mattia, A.Esposito and A.Vinciarelli, "Depression Speaks: Automatic Discrimination Between Depressed and Non-Depressed Speakers Based on Nonverbal Speech Features", IEEE SigPort, 2018. [Online]. Available: http://sigport.org/2992. Accessed: Jun. 26, 2019.
@article{2992-18,
url = {http://sigport.org/2992},
author = {F.Scibelli; G.Roffo; M.Tayarani; L.Bartoli; G.De Mattia; A.Esposito and A.Vinciarelli },
publisher = {IEEE SigPort},
title = {Depression Speaks: Automatic Discrimination Between Depressed and Non-Depressed Speakers Based on Nonverbal Speech Features},
year = {2018} }
TY - EJOUR
T1 - Depression Speaks: Automatic Discrimination Between Depressed and Non-Depressed Speakers Based on Nonverbal Speech Features
AU - F.Scibelli; G.Roffo; M.Tayarani; L.Bartoli; G.De Mattia; A.Esposito and A.Vinciarelli
PY - 2018
PB - IEEE SigPort
UR - http://sigport.org/2992
ER -
F.Scibelli, G.Roffo, M.Tayarani, L.Bartoli, G.De Mattia, A.Esposito and A.Vinciarelli. (2018). Depression Speaks: Automatic Discrimination Between Depressed and Non-Depressed Speakers Based on Nonverbal Speech Features. IEEE SigPort. http://sigport.org/2992
F.Scibelli, G.Roffo, M.Tayarani, L.Bartoli, G.De Mattia, A.Esposito and A.Vinciarelli, 2018. Depression Speaks: Automatic Discrimination Between Depressed and Non-Depressed Speakers Based on Nonverbal Speech Features. Available at: http://sigport.org/2992.
F.Scibelli, G.Roffo, M.Tayarani, L.Bartoli, G.De Mattia, A.Esposito and A.Vinciarelli. (2018). "Depression Speaks: Automatic Discrimination Between Depressed and Non-Depressed Speakers Based on Nonverbal Speech Features." Web.
1. F.Scibelli, G.Roffo, M.Tayarani, L.Bartoli, G.De Mattia, A.Esposito and A.Vinciarelli. Depression Speaks: Automatic Discrimination Between Depressed and Non-Depressed Speakers Based on Nonverbal Speech Features [Internet]. IEEE SigPort; 2018. Available from : http://sigport.org/2992

A First Attempt at Polyphonic Sound Event Detection Using Connectionist Temporal Classification


Sound event detection is the task of detecting the type, starting time, and ending time of sound events in audio streams. Recently, recurrent neural networks (RNNs) have become the mainstream solution for sound event detection. Because RNNs make a prediction at every frame, it is necessary to provide exact starting and ending times of the sound events in the training data, making data annotation an extremely time-consuming process.

Paper Details

Authors:
Florian Metze
Submitted On:
27 February 2017 - 5:12pm
Short Link:
Type:
Event:
Presenter's Name:
Paper Code:
Document Year:
Cite

Document Files

2017.03 Poster for ICASSP.pdf

(57)

Subscribe

[1] Florian Metze, "A First Attempt at Polyphonic Sound Event Detection Using Connectionist Temporal Classification", IEEE SigPort, 2017. [Online]. Available: http://sigport.org/1451. Accessed: Jun. 26, 2019.
@article{1451-17,
url = {http://sigport.org/1451},
author = {Florian Metze },
publisher = {IEEE SigPort},
title = {A First Attempt at Polyphonic Sound Event Detection Using Connectionist Temporal Classification},
year = {2017} }
TY - EJOUR
T1 - A First Attempt at Polyphonic Sound Event Detection Using Connectionist Temporal Classification
AU - Florian Metze
PY - 2017
PB - IEEE SigPort
UR - http://sigport.org/1451
ER -
Florian Metze. (2017). A First Attempt at Polyphonic Sound Event Detection Using Connectionist Temporal Classification. IEEE SigPort. http://sigport.org/1451
Florian Metze, 2017. A First Attempt at Polyphonic Sound Event Detection Using Connectionist Temporal Classification. Available at: http://sigport.org/1451.
Florian Metze. (2017). "A First Attempt at Polyphonic Sound Event Detection Using Connectionist Temporal Classification." Web.
1. Florian Metze. A First Attempt at Polyphonic Sound Event Detection Using Connectionist Temporal Classification [Internet]. IEEE SigPort; 2017. Available from : http://sigport.org/1451

Natural Sound Rendering for Headphones: Integration of signal processing techniques


With the strong growth of assistive and personal listening devices, natural sound rendering over headphones is becoming a necessity for prolonged listening in multimedia and virtual reality applications. The aim of natural sound rendering is to naturally recreate the sound scenes with the spatial and timbral quality as natural as possible, so as to achieve a truly immersive listening experience. However, rendering natural sound over headphones encounters many challenges. This tutorial article presents signal processing techniques to tackle these challenges to assist human listening.

Paper Details

Authors:
Kaushik Sunder, Ee-Leng Tan
Submitted On:
23 February 2016 - 1:44pm
Short Link:
Type:

Document Files

SPM2015manuscript-Natural Sound Rendering for Headphones.pdf

(111)

Subscribe

[1] Kaushik Sunder, Ee-Leng Tan, "Natural Sound Rendering for Headphones: Integration of signal processing techniques", IEEE SigPort, 2015. [Online]. Available: http://sigport.org/166. Accessed: Jun. 26, 2019.
@article{166-15,
url = {http://sigport.org/166},
author = {Kaushik Sunder; Ee-Leng Tan },
publisher = {IEEE SigPort},
title = {Natural Sound Rendering for Headphones: Integration of signal processing techniques},
year = {2015} }
TY - EJOUR
T1 - Natural Sound Rendering for Headphones: Integration of signal processing techniques
AU - Kaushik Sunder; Ee-Leng Tan
PY - 2015
PB - IEEE SigPort
UR - http://sigport.org/166
ER -
Kaushik Sunder, Ee-Leng Tan. (2015). Natural Sound Rendering for Headphones: Integration of signal processing techniques. IEEE SigPort. http://sigport.org/166
Kaushik Sunder, Ee-Leng Tan, 2015. Natural Sound Rendering for Headphones: Integration of signal processing techniques. Available at: http://sigport.org/166.
Kaushik Sunder, Ee-Leng Tan. (2015). "Natural Sound Rendering for Headphones: Integration of signal processing techniques." Web.
1. Kaushik Sunder, Ee-Leng Tan. Natural Sound Rendering for Headphones: Integration of signal processing techniques [Internet]. IEEE SigPort; 2015. Available from : http://sigport.org/166