DETECTS: Deep Clustering of Temporal Skeletons for Graph-based Segmentation

Unsupervised Temporal Action Localization (UTAL) aims to segment untrimmed videos into semantically coherent actions without using temporal annotations. Existing UTAL methods rely on contrastive pretext tasks or shallow clustering pipelines that decouple representation learning from segmentation, limiting their ability to capture fine-grained temporal transitions. In this work, we propose a unified deep clustering framework for skeleton-based UTAL that formulates motion segmentation as a spatio-temporal graph separation problem in the embedding space. Specifically, we use an Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN) to encode local 3D skeletal pose sequences, followed by transformer-based adaptive spatial pooling to obtain frame-level embeddings. We then apply DBSCAN clustering over these temporal embeddings, with the minimum radius computed using a geometric ball-covering algorithm to ensure density-aware segmentation. To train the model end-to-end, we introduce a dual-branch objective comprising a self-supervised reconstruction loss and two clustering regularizers based on silhouette score and intra-cluster variance. Our method does not rely on downstream fine-tuning, handcrafted decoders, or pseudo-label propagation. Experiments on the BABEL benchmark demonstrate that our approach outperforms prior state-of-the-art UTAL methods by a large margin, achieving an average mAP of 51.53%. Our framework provides a scalable and annotation-free solution for discovering motion primitives in 3D skeleton videos.

ICASSP_2026 (Supplementary Material).pdf

Detailed Methodology and Theoretical Justification (36)

Thumbs Up

CITE

Documents

Supplementary Material

DETECTS: Deep Clustering of Temporal Skeletons for Graph-based Segmentation

ICASSP_2026 (Supplementary Material).pdf

QUESTIONS?