Documents
Presentation Slides
SweepMM: A High-Quality Multimodal Dataset For Sweeping Robots In Home Scenarios For Vision-Language Model
- DOI:
- 10.60864/0fza-ms21
- Citation Author(s):
- Submitted by:
- Weichen Xu
- Last updated:
- 6 June 2024 - 10:28am
- Document Type:
- Presentation Slides
- Categories:
- Log in to post comments
The X-ray security inspection aims to identify any restricted items to protect public safety. Due to the lack of focus on unsupervised learning in this field, using pre-trained models on natural images leads to suboptimal results in downstream tasks. Previous works would lose the relative positional relationships during the pre-training process, which is detrimental for X-ray images that lack texture and rely on shape. In this paper, we propose the jigsaw style MAE (J-MAE) to preserve the relative position information by shuffling the position encoding of visible patches. This forces the network to perform semantic reasoning to understand the shape and composition of X-ray objects. Meanwhile, we propose the Incremental Shuffling Module (ISM) and Permute Predicting Module (PPM) to make the training process more stable and accelerate convergence. Our proposed method has consistently outperformed other methods on three downstream X-ray security inspection datasets.