Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks

Multimedia event detection (MED) is the task of detecting given events (e.g. birthday party, making a sandwich) in a large collection of video clips. While visual features and automatic speech recognition typically provide the best features for this task, non-speech audio can also contribute useful information, such as crowds cheering, engine noises, or animal sounds.

MED is typically formulated as a two-stage process: the first stage generates clip-level feature representations, often by aggregating frame-level features; the second stage performs binary or multi-class classification to decide whether a given event occurs in a video clip. Both stages are usually performed "statically", i.e. using only local temporal information, or bag-of-words models.

In this paper, we introduce longer-range temporal information with deep recurrent neural networks (RNNs) for both stages. We classify each audio frame among a set of semantic units called "noisemes"; the sequence of frame-level confidence distributions is used as a variable-length clip-level representation. Such confidence vector sequences are then fed into long short-term memory (LSTM) networks for clip-level classification. We observe improvements in both frame-level and clip-level performance compared to SVM and feed-forward neural network baselines.

2016.03 For ICASSP.ppt

2016.03 For ICASSP.ppt (110)

Thumbs Up

CITE

Documents

Presentation Slides

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks

2016.03 For ICASSP.ppt

QUESTIONS?