Multi-Task Joint-Learning for Robust Voice Activity Detection

Model based VAD approaches have been widely used and
achieved success in practice. These approaches usually cast
VAD as a frame-level classification problem and employ statistical
classifiers, such as Gaussian Mixture Model (GMM) or
Deep Neural Network (DNN) to assign a speech/silence label
for each frame. Due to the frame independent assumption classification,
the VAD results tend to be fragile. To address this
problem, in this paper, a new structured multi-frame prediction
DNN approach is proposed to improve the segment-level
VAD performance. During DNN training, VAD labels of multiple
consecutive frames are concatenated together as targets and
jointly trained with a speech enhancement task to achieve robustness
under noisy conditions. During testing, the VAD label
for each frame is obtained by merging the prediction results
from neighbouring frames. Experiments on an Aurora 4
dataset showed that, conventional DNN based VAD has poor
and unstable prediction performance while the proposed multitask
trained VAD is much more robust.

zhuang-iscslp16-slides.pdf

zhuang-iscslp16-slides.pdf (349)

Thumbs Up

CITE

Documents

Presentation Slides

Multi-Task Joint-Learning for Robust Voice Activity Detection

zhuang-iscslp16-slides.pdf

QUESTIONS?