Sorry, you need to enable JavaScript to visit this website.

DETECTS: Deep Clustering of Temporal Skeletons for Graph-based Segmentation

Citation Author(s):
Submitted by:
Vipul Baghel
Last updated:
18 September 2025 - 5:32am
Document Type:
Supplementary Material
Document Year:
2025
Paper Code:
14907
 

Unsupervised Temporal Action Localization (UTAL) aims to segment untrimmed videos into semantically coherent actions without using temporal annotations. Existing UTAL methods rely on contrastive pretext tasks or shallow clustering pipelines that decouple representation learning from segmentation, limiting their ability to capture fine-grained temporal transitions. In this work, we propose a unified deep clustering framework for skeleton-based UTAL that formulates motion segmentation as a spatio-temporal graph separation problem in the embedding space. Specifically, we use an Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN) to encode local 3D skeletal pose sequences, followed by transformer-based adaptive spatial pooling to obtain frame-level embeddings. We then apply DBSCAN clustering over these temporal embeddings, with the minimum radius computed using a geometric ball-covering algorithm to ensure density-aware segmentation. To train the model end-to-end, we introduce a dual-branch objective comprising a self-supervised reconstruction loss and two clustering regularizers based on silhouette score and intra-cluster variance. Our method does not rely on downstream fine-tuning, handcrafted decoders, or pseudo-label propagation. Experiments on the BABEL benchmark demonstrate that our approach outperforms prior state-of-the-art UTAL methods by a large margin, achieving an average mAP of 51.53%. Our framework provides a scalable and annotation-free solution for discovering motion primitives in 3D skeleton videos.

up
0 users have voted: