Sorry, you need to enable JavaScript to visit this website.

Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech.

Categories:
23 Views

Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In this work, we present a multilingual AVSR model incorporating several enhancements to improve performance and audio noise robustness. Notably, we adapt the recently proposed Fast Conformer model to process both audio and visual modalities using a novel hybrid CTC/RNN-T architecture.

Categories:
47 Views

Compared to the natural image community, infrared target detection suffers more challenges due to the severely tiny and low-contrast objects, especially in cases with obscuration from clutter and noise. The traditional solutions are susceptible to noise interference, which yields suboptimal performance lacking of contour and texture details. Meanwhile, due to the spatial invariance of convolutional layers, most deep learning-based methods locate small targets loosely during feature extraction, leading to serious omissions.

Categories:
20 Views

Existing Monocular Depth Estimation (MDE) methods often use large and complex neural networks. Despite the advanced performance of these methods, we consider the efficiency and generalization for practical applications with limited resources. In our paper, we present an efficient transformer-based monocular relative depth estimation network and train it with a diverse depth dataset to obtain good generalization performance.

Categories:
30 Views

Video summarization involves creating a succinct overview by merging the valuable parts of a video. Existing video summarization
methods approach this task as a problem of selecting keyframes
by frame- and shot-level techniques with unimodal or bimodal information. Besides underestimated inter-relations between various
configurations of modality embedding spaces, current methods are
also limited in their ability to maintain the integrity of the semantics within the same summary segment. To address these issues,

Categories:
22 Views

Recently, many algorithms have employed image-adaptive lookup tables (LUTs) to achieve real-time image enhancement. Nonetheless, a prevailing trend among existing methods has been the employment of linear combinations of basic LUTs to formulate image-adaptive LUTs, which limits the generalization ability of these methods. To address this limitation, we propose a novel framework named AttentionLut for real-time image enhancement, which utilizes the attention mechanism to generate image-adaptive LUTs. Our proposed framework consists of three lightweight modules.

Categories:
14 Views

Recurrently refining the optical flow based on a single highresolution feature demonstrates high performance. We exploit the strength of this strategy to build a novel architecture for the joint learning of optical flow and depth. Our proposed architecture is improved to work in the case of training on unlabeled data, which is extremely challenging. The loss is computed for the iterations carried out over a single high-resolution feature, where the reconstruction loss fails to optimize the accuracy particularity in occluded regions.

Categories:
46 Views

Deploying style transfer methods on resource-constrained devices is challenging, which limits their real-world applicability. To tackle this issue, we propose using pruning techniques to accelerate various visual style transfer methods. We argue that typical pruning methods may not be well-suited for style transfer methods and present an iterative correlation-based channel pruning (ICCP) strategy for encoder-transform-decoder-based image/video style transfer models.

Categories:
20 Views

In weakly supervised video anomaly detection, it has been verified that anomalies can be biased by background noise. Previous works attempted to focus on local regions to exclude irrelevant information. However, the abnormal events in different scenes vary in size, and current methods struggle to consider local events of different scales concurrently. To this end, we propose a multi-scale integrated perception

Categories:
13 Views

Pages