Sorry, you need to enable JavaScript to visit this website.

MULTI-VIEW AUDIO-ARTICULATORY FEATURES FOR PHONETIC RECOGNITION ON RTMRI-TIMIT DATABASE

Citation Author(s):
Ioannis Douros, Athanasios Katsamanis, Petros Maragos
Submitted by:
Ioannis Douros
Last updated:
13 April 2018 - 2:13pm
Document Type:
Poster
Event:
Presenters:
Ioannis Douros
Paper Code:
3783
 

In this paper, we investigate the use of articulatory informa-
tion, and more specifically real time Magnetic Resonance
Imaging (rtMRI) data of the vocal tract, to improve speech
recognition performance. For the purpose of our experiments,
we use data from the rtMRI-TIMIT database. Firstly, Scale
Invariant Feature Transform (SIFT) features are extracted for
each video frame. Afterwards, the SIFT descriptors of each
frame are transformed to a single histogram per picture, by
using the Bag of Visual Words methodology. Since this kind
of articulatory information is difficult to acquire in typical
speech recognition setups we only consider it to be available
in the training phase. Thus, we use a multi-view setup ap-
proach by applying Canonical Correlation Analysis (CCA) to
visual and audio data. By using the transformation matrix,
acquired during the training stage, we transform both train
and test audio data to produce MFCC-articulatory features,
which form the input for the recognition system. Experimen-
tal results demonstrate improvements in phone recognition in
comparison with the audio-based baseline.

up
0 users have voted: