Supplementary Material

Although many deepfake detection methods have been proposed to fight against severe misuse of generative AI, none provide detailed human-interpretable explanations beyond simple real/fake responses. This limitation makes it challenging for humans to assess the accuracy of detection results, especially when the models encounter unseen deepfakes. To address this issue, we propose a novel deepfake detector based on a large Vision-Language Model (VLM), capable of explaining manipulated facial regions. We frame the deepfake detection task as Visual Question Answering (VQA) and perform visual instruction tuning to train the model on our collected \textbf{Ex}plainable \textbf{D}eepfake \textbf{F}ace (\textbf{ExDF}) dataset. The dataset consists of fake images from diverse generative adversarial networks (GANs) and diffusion models (DMs), with explanations produced by GPT-4o guided by the corresponding ground-truth masks of the manipulated regions. Moreover, a facial mask encoder is introduced to guide the model to focus on key facial features, thereby improving the detection and explanation performances. Extensive experiments demonstrate that training the proposed model on the full ExDF dataset not only enhances detection accuracy compared to baseline methods but also provides detailed, human-interpretable explanations. To our knowledge, ExDF is the first explainable deepfake face dataset covering both GANs and DMs with comprehensive descriptions of altered facial regions.