Sorry, you need to enable JavaScript to visit this website.

Recently, cross modal compression (CMC) is proposed to compress highly redundant visual data into a compact, common, human-comprehensible domain (such as text) to preserve semantic fidelity for semantic-related applications. However, CMC only achieves a certain level of semantic fidelity at a constant rate, and the model aims to optimize the probability of the ground truth text but not directly semantic fidelity. To tackle the problems, we propose a novel scheme named rate-distortion optimized CMC (RDO-CMC).

Categories:
62 Views

Rate-distortion (RD) theory is a fundamental theory for lossy image compression that treats compressing the original images to a specified bitrate with minimal signal distortion, which is an essential metric in practical application. Moreover, with the development of visual analysis applications (such as classification, detection, segmentation, etc.), the semantic distortion in compressed images are also an important dimension in the theoretical analysis of lossy image compression.

Categories:
62 Views

In this paper, we propose an approach for learning binary hash codes
for image retrieval. Canonical Correlation Analysis (CCA) is used
to design two loss functions for training a neural network such that
the correlation between the two views to CCA is maximum. The
main motivation for using CCA for feature space learning is that
dimensionality reduction is possible and short binary codes could
be generated. The first loss maximizes the correlation between the
hash centers and the learned hash codes. The second loss maximizes

Categories:
16 Views

Automatic song writing (ASW) typically involves four tasks: lyric-to-lyric generation, melody-to-melody generation, lyric-to-melody generation, and melody-to-lyric generation.
Previous works have mainly focused on individual tasks without considering the correlation between them, and thus a unified framework to solve all four tasks has not yet been explored.

Categories:
11 Views

Incorporating visual information is a promising approach to improve the performance of speech separation. Many related works have been conducted and provide inspiring results. However, low quality videos appear commonly in real scenarios, which may significantly degrade the performance of normal audio-visual speech separation system. In this paper, we propose a new structure to fuse the audio and visual features, which uses the audio feature to select relevant visual features by utilizing the attention mechanism.

Categories:
21 Views

A novel method based on time-lag aware multi-modal variational autoencoder for prediction of important scenes (Tl-MVAE-PIS) using baseball videos and tweets posted on Twitter is presented in this paper. This paper has the following two technical contributions. First, to effectively use heterogeneous data for the prediction of important scenes, we transform textual, visual and audio features obtained from tweets and videos to the latent features. Then Tl-MVAE-PIS can flexibly express the relationships between them in the constructed latent space.

Categories:
52 Views

A novel method based on time-lag aware multi-modal variational autoencoder for prediction of important scenes (Tl-MVAE-PIS) using baseball videos and tweets posted on Twitter is presented in this paper. This paper has the following two technical contributions. First, to effectively use heterogeneous data for the prediction of important scenes, we transform textual, visual and audio features obtained from tweets and videos to the latent features. Then Tl-MVAE-PIS can flexibly express the relationships between them in the constructed latent space.

Categories:
5 Views

Pages