ET: Explain to Train: Leveraging Explanations to Enhance the Training of A Multimodal Transformer

Explainable Artificial Intelligence (XAI) has become increasingly vital for improving the transparency and reliability of neural network decisions. Transformer architectures have emerged as the state-of-the-art for various tasks across single modalities such as video, language, or signals, as well as for multimodal approaches. Although XAI methods for transformers are available, their potential impact during model training remains underexplored. Thus, we propose Explanation-guided Training (ET), leveraging an XAI method to identify salient input regions and guide the model to focus solely on these salient regions during training. We develop ET in a typical multimodal analysis framework using a multimodal transformer that operates on videos and signals. ET enhances the input by masking the non-salient regions for videos and enhances the signals with weights based on explanation scores for the sensor modality. Comparative evaluation with baseline vanilla training and the state-of-the-art XAI-based IFI method shows that ET consistently outperforms them. We benchmark our method on the publicly available UCF50 video dataset to demonstrate that ET is better than vanilla training and IFI. A risk detection corpus comprising egocentric videos and wearable sensor data is used for multimodal evaluation.

Explain To Train (ET)_ ICIP 2024.pdf

Oral Presentation Slides (142)

Links:

Code

Access Full Paper

Thumbs Up

CITE

Documents

Presentation Slides

ET: Explain to Train: Leveraging Explanations to Enhance the Training of A Multimodal Transformer

Explain To Train (ET)_ ICIP 2024.pdf

QUESTIONS?