Sorry, you need to enable JavaScript to visit this website.

ET: Explain to Train: Leveraging Explanations to Enhance the Training of A Multimodal Transformer

DOI:
10.60864/c4gj-6283
Citation Author(s):
Jenny Benois-Pineau, Akka Zemmari
Submitted by:
Meghna Ayyar
Last updated:
12 November 2024 - 6:05am
Document Type:
Presentation Slides
Document Year:
2024
Event:
Presenters:
Meghna P Ayyar
Paper Code:
1673
 

Explainable Artificial Intelligence (XAI) has become increasingly vital for improving the transparency and reliability of neural network decisions. Transformer architectures have emerged as the state-of-the-art for various tasks across single modalities such as video, language, or signals, as well as for multimodal approaches. Although XAI methods for transformers are available, their potential impact during model training remains underexplored. Thus, we propose Explanation-guided Training (ET), leveraging an XAI method to identify salient input regions and guide the model to focus solely on these salient regions during training. We develop ET in a typical multimodal analysis framework using a multimodal transformer that operates on videos and signals. ET enhances the input by masking the non-salient regions for videos and enhances the signals with weights based on explanation scores for the sensor modality. Comparative evaluation with baseline vanilla training and the state-of-the-art XAI-based IFI method shows that ET consistently outperforms them. We benchmark our method on the publicly available UCF50 video dataset to demonstrate that ET is better than vanilla training and IFI. A risk detection corpus comprising egocentric videos and wearable sensor data is used for multimodal evaluation.

up
0 users have voted: