Sorry, you need to enable JavaScript to visit this website.

A First Attempt at Polyphonic Sound Event Detection Using Connectionist Temporal Classification

Citation Author(s):
Florian Metze
Submitted by:
Yun Wang
Last updated:
27 February 2017 - 5:12pm
Document Type:
Document Year:
Yun Wang
Paper Code:


Sound event detection is the task of detecting the type, starting time, and ending time of sound events in audio streams. Recently, recurrent neural networks (RNNs) have become the mainstream solution for sound event detection. Because RNNs make a prediction at every frame, it is necessary to provide exact starting and ending times of the sound events in the training data, making data annotation an extremely time-consuming process. Connectionist temporal classification (CTC), as a sequence-to-sequence model, can relax this constraint, because it suffices to provide ordered sequences of sound events without exact starting and ending times.

This paper presents a first attempt at using CTC for sound event detection. In the polyphonic situation, sound events may overlap with each other, making it hard to define ordered sequences of sound events. We propose to use the boundaries (i.e. starts and ends) of the sound events as tokens for CTC. We show that CTC is able to locate the boundaries of sound events on a very noisy corpus of consumer generated content with rough hints about their positions. The CTC approach seems to be particularly suited to detecting short and transient sounds, which have traditionally been hardest to detect.

0 users have voted:


2017.03 Poster for ICASSP.pdf