Sorry, you need to enable JavaScript to visit this website.

The movement of tongue plays an important role in pronunciation. Visualizing the movement of tongue can improve speech intelligibility and also helps learning a second language. However, hardly any research has been investigated for this topic. In this paper, a framework to synthesize continuous ultrasound tongue movement video from speech is presented. Two different mapping methods are introduced as the most important parts of the framework.


Image inpainting consists in filling missing regions of an image by inferring from the surrounding content.
In the case of texture images, inpainting can be formulated in terms of conditional simulation of a stochastic texture model.
Many texture synthesis methods thus have been adapted to texture inpainting, but these methods do not offer theoretical guarantees since the conditional sampling is in general only approximate.


Previous works on actor identification mainly focused on static
features based on face identification and costume detection,
without considering the abundant dynamic information contained
in videos. In this paper, we propose a novel method
to mine representative actions of each actor, and show the remarkable
power of such actions for actor identification task.
Videos are firstly divided into shots and represented by BoW
based on spatial-temporal features. Then we integrate the prototype


A latent style model describing manga styles based on the proposed manga-specific features is constructed to facilitate novel style-based applications. Two manga-specific features, i.e., screentone features showing texture and shade, and panel features showing panel arrangement, are firstly proposed to describe manga pages. Based on the latent Dirichlet allocation technique, we discover latent style elements embedded in manga documents, which are described by visual words derived from manga-specific features.


An automatic news story clustering system is presented to facilitate efficient news browsing and summarization. We describe news content by considering both what objects appear and how these objects move in news stories. With Fisher embedding, we respectively encode local features, semantics features, and dense trajectories as Fisher vectors, based on which similarity between news stories can be well evaluated and thus better clustering performance can be obtained.


An elapsed facial emotion involves changes of facial contour due to the motions (such as contraction or stretch) of facial muscles located at the eyes, nose, lips and etc. Thus, the important information such as corners of facial contours that are located in various regions of the face are crucial to the recognition of facial expressions, and even more apparent for micro-expressions. In this paper, we propose the first known notion of employing intrinsic two-dimensional (i2D) local structures to represent these features for micro-expression recognition.


It is a great challenge to perform high level recognition tasks on videos that are poor in quality. In this paper, we propose a new spatio-temporal mid-level (STEM) feature bank for recognizing human actions in low quality videos. The feature bank comprises of a trio of local spatio-temporal features, i.e. shape, motion and textures, which respectively encode structural, dynamic and statistical information in video. These features are encoded into mid-level representations and aggregated to construct STEM.


In this paper, we present Discriminant Correlation Analysis (DCA), a feature level fusion technique that incorporates the class associations in correlation analysis of the feature sets. DCA performs an effective feature fusion by maximizing the pair-wise correlations across the two feature sets, and at the same time, eliminating the between-class correlations and restricting the correlations to be within classes.

