Multimodal signal processing

TIME-LAG AWARE MULTI-MODAL VARIATIONAL AUTOENCODER USING BASEBALL VIDEOS AND TWEETS FOR PREDICTION OF IMPORTANT SCENES

A novel method based on time-lag aware multi-modal variational autoencoder for prediction of important scenes (Tl-MVAE-PIS) using baseball videos and tweets posted on Twitter is presented in this paper. This paper has the following two technical contributions. First, to effectively use heterogeneous data for the prediction of important scenes, we transform textual, visual and audio features obtained from tweets and videos to the latent features. Then Tl-MVAE-PIS can flexibly express the relationships between them in the constructed latent space.

ICIP2021_hirasawa_submit.pdf

ICIP2021_hirasawa_submit.pdf (263)

Categories:: Multimodal signal processing

9 Views

End-to-End Audio-Visual Speech Recognition with Conformers

Read more about End-to-End Audio-Visual Speech Recognition with Conformers
Log in to post comments

In this work, we present a hybrid CTC/Attention model based on a modified ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Percep- tron (MLP). The model learns to recognise characters using a com- bination of CTC and an attention mechanism.

conformers_poster.pdf

conformers_poster.pdf (271)

Categories:: Multimodal signal processing

34 Views

An Adaptive Multi-Scale and Multi-Level Features Fusion Network with Perceptual Loss for Change Detection

Change detection plays a vital role in monitoring and analyzing temporal changes in Earth observation tasks. This paper proposes a novel adaptive multi-scale and multi-level features fusion network for change detection in very-high-resolution bi-temporal remote sensing images. The proposed approach has three advantages. Firstly, it excels in abstracting high-level representations empowered by a highly effective feature extraction module.

MFPNet_Slides.pdf

Presentation (216)

MFPNet_poster.pdf

Poster (234)

Categories:: Multimodal signal processing
Image/Video Processing

27 Views

ADAPTIVE RE-BALANCING NETWORK WITH GATE MECHANISM FOR LONG-TAILED VISUAL QUESTION ANSWERING

ARN.pptx

ARN.pptx (202)

Categories:: Multimodal signal processing

9 Views

Multi-Layer Content Interaction Through Quaternion Product for Visual Question Answering

Multi-modality fusion technologies have greatly improved the performance of neural network-based Video Description/Caption, Visual Question Answering (VQA) and Audio Visual Scene-aware Di-alog (AVSD) over the recent years. Most previous approaches only explore the last layers of multiple layer feature fusion while omit-ting the importance of intermediate layers. To solve the issue for the intermediate layers, we propose an efficient Quaternion Block Net-work (QBN) to learn interaction not only for the last layer but also for all intermediate layers simultaneously.

Icassp2020.pdf

Icassp2020_Multi-Layer_Content_Interaction_Through_Quaternion_Product_for_Visual_Question_Answering (304)

Categories:: Multimodal signal processing

15 Views

WHAT MAKES THE SOUND?: A DUAL-MODALITY INTERACTING NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION

The presence of auditory and visual senses enables humans to obtain a profound understanding of the real-world scenes. While audio and visual signals are capable of providing scene knowledge individually, the combination of both offers a better insight about the underlying event. In this paper, we address the problem of audio-visual event localization where the goal is to identify the presence of an event that is both audible and visible in a video, using fully or weakly supervised learning.

What_makes_the_sound_ICASSP2020.pdf

What_makes_the_sound_ICASSP2020.pdf (336)

Categories:: Multimodal signal processing

67 Views

Spectrogram Analysis Via Self-Attention for Realizing Cross-Model Visual-Audio Generation

Human cognition is supported by the combination of multi- modal information from different sources of perception. The two most important modalities are visual and audio. Cross- modal visual-audio generation enables the synthesis of da- ta from one modality following the acquisition of data from another. This brings about the full experience that can only be achieved through the combination of the two. In this pa- per, the Self-Attention mechanism is applied to cross-modal visual-audio generation for the first time.

SA-CMGAN poster.pdf

SA-CMGAN (294)

Categories:: Multimodal signal processing

55 Views

Intra Prediction in the Emerging VVC Video Coding Standard

Read more about Intra Prediction in the Emerging VVC Video Coding Standard
Log in to post comments

193-Intra-prediction-VVC-poster.pdf

Intra Prediction in the Emerging VVC Video Coding Standard (370)

Categories:: Multimodal signal processing

63 Views

Lightweight Deep Convolutional Neural Networks for Facial Epression Recognition

Read more about Lightweight Deep Convolutional Neural Networks for Facial Epression Recognition
Log in to post comments

MMSP_poster_A0_v3.2.pdf

MMSP_poster_A0_v3.2.pdf (693)

Categories:: Multimodal signal processing

35 Views

An Occlusion Probability Model for Improving the Rendering Quality of Views

Read more about An Occlusion Probability Model for Improving the Rendering Quality of Views
Log in to post comments

Occlusion as a common phenomenon in object surface can seriously affect information collection of light field. To visualize light field data-set, occlusions are usually idealized and neglected for most prior light field rendering (LFR) algorithms. However, the 3D spatial structure of some features may be missing to capture some incorrect samples caused by occlusion discontinuities. To solve this problem, we propose an occlusion probability (OCP) model to improve the capturing information and the rendering quality of views with occlusion for the LFR.

occlusion_MMSP2019.pdf

occlusion_MMSP2019.pdf (381)

Categories:: Multimodal signal processing

23 Views

Multimodal signal processing

Pages