Documents
Presentation Slides
Contextual Human Object Interaction Understanding From Pre-Trained Large Language Model
- Citation Author(s):
- Submitted by:
- JIANJUN GAO
- Last updated:
- 2 April 2024 - 3:59am
- Document Type:
- Presentation Slides
- Document Year:
- 2024
- Event:
- Paper Code:
- SS-L16.3
- Categories:
- Log in to post comments
Existing human object interaction (HOI) detection methods have introduced zero-shot learning techniques to recognize unseen interactions, but they still have limitations in understanding context information and comprehensive reasoning. To overcome these limitations, we propose a novel HOI learning framework, ContextHOI, which serves as an effective contextual HOI detector to enhance contextual understanding and zero-shot reasoning ability. The main contributions of the proposed ContextHOI are a novel context-mining decoder and a powerful interaction reasoning large language model (LLM). The context-mining decoder aims to extract linguistic contextual information from a pre-trained vision-language model. Based on the extracted context information, the proposed interaction reasoning LLM further enhances the zero-shot reasoning ability by leveraging rich linguistic knowledge. Extensive evaluation demonstrates that our proposed framework outperforms existing zero-shot methods on the HICO-DET and SWIG-HOI datasets, as high as 19.34% mAP on unseen interaction can be achieved.