- Image/Video Storage, Retrieval
- Image/Video Processing
- Image/Video Coding
- Image Scanning, Display, and Printing
- Image Formation
- Read more about Immersive Optical-See-Through Augmented Reality (Keynote Talk)
- Log in to post comments
Immersive Optical-See-Through Augmented Reality. Augmented Reality has been getting ready for the last 20 years, and is finally becoming real, powered by progress in enabling technologies such as graphics, vision, sensors, and displays. In this talk I’ll provide a personal retrospective on my journey, working on all those enablers, getting ready for the coming AR revolution. At Meta, we are working on immersive optical-see-through AR headset, as well as the full software stack. We’ll discuss the differences of optical vs.
- Categories:
- Read more about Retrieval-Augmented Natural Language Reasoning for Explainable Visual Question Answering
- Log in to post comments
Visual Question Answering with Natural Language Explanation (VQA-NLE) task is challenging due to its high demand for reasoning-based inference. Recent VQA-NLE studies focus on enhancing model networks to amplify the model’s reasoning capability but this approach is resource consuming and unstable. In this work, we introduce a new VQA-NLE model, ReRe (Retrieval-augmented natural language Reasoning), using leverage retrieval information from the memory to aid in generating accurate answers and persuasive explanations without relying on complex networks and extra datasets.
- Categories:
- Read more about FEATURE-CONSTRAINED AND ATTENTION-CONDITIONED DISTILLATION LEARNING FOR VISUAL ANOMALY DETECTION
- Log in to post comments
Visual anomaly detection in computer vision is an essential one-class classification and segmentation problem. The student-teacher (S-T) approach has proven effective in addressing this challenge. However, previous studies based on S-T underutilize the feature representations learned by the teacher network, which restricts anomaly detection performance.
- Categories:
- Read more about MULTI-MODALITY ACTION RECOGNITION BASED ON DUAL FEATURE SHIFT IN VEHICLE CABIN MONITORING
- Log in to post comments
Driver Action Recognition (DAR) is crucial in vehicle cabin monitoring systems. In real-world applications, it is common for vehicle cabins to be equipped with cameras featuring different modalities. However, multi-modality fusion strategies for the DAR task within car cabins have rarely been studied. In this paper, we propose a novel yet efficient multi-modality driver action recognition method based on dual feature shift, named DFS. DFS first integrates complementary features across modalities by performing modality feature interaction.
- Categories:
- Read more about MULTILINGUAL AUDIO-VISUAL SPEECH RECOGNITION WITH HYBRID CTC/RNN-T FAST CONFORMER
- Log in to post comments
Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In this work, we present a multilingual AVSR model incorporating several enhancements to improve performance and audio noise robustness. Notably, we adapt the recently proposed Fast Conformer model to process both audio and visual modalities using a novel hybrid CTC/RNN-T architecture.
- Categories:
- Read more about AUDIO-VISUAL ACTIVE SPEAKER EXTRACTION FOR SPARSELY OVERLAPPED MULTI-TALKER SPEECH
- Log in to post comments
Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech.
- Categories:
- Read more about MULTILINGUAL AUDIO-VISUAL SPEECH RECOGNITION WITH HYBRID CTC/RNN-T FAST CONFORMER
- Log in to post comments
Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In this work, we present a multilingual AVSR model incorporating several enhancements to improve performance and audio noise robustness. Notably, we adapt the recently proposed Fast Conformer model to process both audio and visual modalities using a novel hybrid CTC/RNN-T architecture.
- Categories:
- Read more about HMNet: Hierarchical Microscale-aware Network for Infrared Small Target Detection
- Log in to post comments
Compared to the natural image community, infrared target detection suffers more challenges due to the severely tiny and low-contrast objects, especially in cases with obscuration from clutter and noise. The traditional solutions are susceptible to noise interference, which yields suboptimal performance lacking of contour and texture details. Meanwhile, due to the spatial invariance of convolutional layers, most deep learning-based methods locate small targets loosely during feature extraction, leading to serious omissions.
- Categories:
- Read more about Robust Lightweight Depth Estimation Model via Data-free Distillation
- Log in to post comments
Existing Monocular Depth Estimation (MDE) methods often use large and complex neural networks. Despite the advanced performance of these methods, we consider the efficiency and generalization for practical applications with limited resources. In our paper, we present an efficient transformer-based monocular relative depth estimation network and train it with a diverse depth dataset to obtain good generalization performance.
- Categories: