Sorry, you need to enable JavaScript to visit this website.

MAP-YNET: LEARNING FROM FOUNDATION MODELS FOR REAL-TIME, MULTI-TASK SCENE PERCEPTION

Citation Author(s):
Submitted by:
Ammar Qammaz
Last updated:
4 February 2025 - 8:26am
Document Type:
Supplementary Material
 

This is video and qualitative supplementary material for ICIP 2025

We present MAP-YNet, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images. MAP-YNet simultaneously predicts depth, surface normals, human pose, semantic segmentation, and generates multi-label captions via a single network evaluation. To achieve this, we adopt a multiteacher, single student training paradigm, where task-specific foundation models supervise the network’s learning, enabling it to distill their capabil-
ities into an architecture suitable for real-time applications. MAP-YNet exhibits strong generalization, simplicity, and computational efficiency, making it ideal for robotics and other practical scenarios. To support future research, we will release our code publicly

up
0 users have voted: