Documents
Presentation Slides
Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks
- Citation Author(s):
- Submitted by:
- Yun Wang
- Last updated:
- 17 March 2016 - 4:13pm
- Document Type:
- Presentation Slides
- Document Year:
- 2016
- Event:
- Presenters:
- Yun Wang
- Categories:
- Log in to post comments
Multimedia event detection (MED) is the task of detecting given events (e.g. birthday party, making a sandwich) in a large collection of video clips. While visual features and automatic speech recognition typically provide the best features for this task, non-speech audio can also contribute useful information, such as crowds cheering, engine noises, or animal sounds.
MED is typically formulated as a two-stage process: the first stage generates clip-level feature representations, often by aggregating frame-level features; the second stage performs binary or multi-class classification to decide whether a given event occurs in a video clip. Both stages are usually performed "statically", i.e. using only local temporal information, or bag-of-words models.
In this paper, we introduce longer-range temporal information with deep recurrent neural networks (RNNs) for both stages. We classify each audio frame among a set of semantic units called "noisemes"; the sequence of frame-level confidence distributions is used as a variable-length clip-level representation. Such confidence vector sequences are then fed into long short-term memory (LSTM) networks for clip-level classification. We observe improvements in both frame-level and clip-level performance compared to SVM and feed-forward neural network baselines.