Multi-modality in Language

End-to-End Audio-Visual Speech Recognition with Conformers

Read more about End-to-End Audio-Visual Speech Recognition with Conformers
Log in to post comments

In this work, we present a hybrid CTC/Attention model based on a modified ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Percep- tron (MLP). The model learns to recognise characters using a com- bination of CTC and an attention mechanism.

conformers_poster.pdf

conformers_poster.pdf (316)

Categories:: Multimodal signal processing

36 Views