Video-Language Graph Convolutional Network for Human Action Recognition

Transferring visual language models (VLMs) from the image domain to the video domain has recently yielded great success on human action recognition tasks. However, standard recognition paradigms overlook fine-grained action parsing knowledge that could enhance the recognition accuracy. In this paper, we propose a novel method that leverages both coarse-grained and fine-grained knowledge to recognize human actions in videos. Our method consists of a video-language graph convolutional network that integrates and fuses multi-modal knowledge in a progressive manner. We evaluate our method on the Kinetics-TPS, a large-scale action parsing dataset, and demonstrate that it outperforms the state-of-the-art methods by a significant margin. Moreover, our method achieves better results with less training data and competitive computational cost than the existing methods, showing the effectiveness and efficiency of using fine-grained knowledge for human video action recognition.

Video-Language Graph Convolutional Network for Human Action Recognition.pptx

presentation slides used in oral talks (220)

Links:

Video-Language Graph Convolutional Network for Human Action Recognition

The source code of Video-Language Graph Convolutional Network for Human Action Recognition

Thumbs Up

CITE

Documents

Presentation Slides

Video-Language Graph Convolutional Network for Human Action Recognition

Video-Language Graph Convolutional Network for Human Action Recognition.pptx

QUESTIONS?