MULTI-VIEW AUDIO-ARTICULATORY FEATURES FOR PHONETIC RECOGNITION ON RTMRI-TIMIT DATABASE

In this paper, we investigate the use of articulatory informa-
tion, and more specifically real time Magnetic Resonance
Imaging (rtMRI) data of the vocal tract, to improve speech
recognition performance. For the purpose of our experiments,
we use data from the rtMRI-TIMIT database. Firstly, Scale
Invariant Feature Transform (SIFT) features are extracted for
each video frame. Afterwards, the SIFT descriptors of each
frame are transformed to a single histogram per picture, by
using the Bag of Visual Words methodology. Since this kind
of articulatory information is difficult to acquire in typical
speech recognition setups we only consider it to be available
in the training phase. Thus, we use a multi-view setup ap-
proach by applying Canonical Correlation Analysis (CCA) to
visual and audio data. By using the transformation matrix,
acquired during the training stage, we transform both train
and test audio data to produce MFCC-articulatory features,
which form the input for the recognition system. Experimen-
tal results demonstrate improvements in phone recognition in
comparison with the audio-based baseline.

ICASSP_2018_poster_final.pdf

ICASSP_2018_poster_final.pdf (1097)

Thumbs Up

CITE

Documents

Poster

MULTI-VIEW AUDIO-ARTICULATORY FEATURES FOR PHONETIC RECOGNITION ON RTMRI-TIMIT DATABASE

ICASSP_2018_poster_final.pdf

QUESTIONS?