Sorry, you need to enable JavaScript to visit this website.

Speech assessment is crucial for many applications, but current intrusive methods cannot be used in real environments. Data-driven approaches have been proposed, but they use simulated speech materials or only estimate objective scores. In this paper, we propose a novel multi-task non-intrusive approach that is capable of simultaneously estimating both subjective and objective scores of real-world speech, to help facilitate learning. This approach enhances our prior work, which estimated subjective mean-opinion scores, where our


Many objective video quality assessment (VQA) algorithms include a key step of temporal pooling of frame-level quality scores. However, less attention has been paid to studying the relative efficiencies of different pooling methods on no-reference (blind) VQA. Here we conduct a large-scale comparative evaluation to assess the capabilities and limitations of multiple temporal pooling strategies on blind VQA of user-generated videos. The study yields insights and general guidance regarding the application and selection of temporal pooling models.


Banding artifact, or false contouring, is a common video compression impairment that tends to appear on large flat regions in encoded videos. These staircase-shaped color bands can be very noticeable in high-definition videos. Here we study this artifact, and propose a new distortion-specific no-reference video quality model for predicting banding artifacts, called the Blind BANding Detector (BBAND index). BBAND is inspired by human visual models. The proposed detector can generate a pixel-wise banding visibility map and output a banding severity score at both the frame and video levels.


In this work, we introduce the No-reference Autoencoder VidEo (NAVE) quality metric, which is based on a deep au-toencoder machine learning technique. The metric uses a set of spatial and temporal features to estimate the overall visual quality, taking advantage of the autoencoder ability to produce a better and more compact set of features. NAVE was tested on two databases: the UnB-AVQ database and the LiveNetflix-II database.


In practical media distribution systems, visual content often undergoes multiple stages of quality degradations along the delivery chain between the source and destination. By contrast, current image quality assessment (IQA) models are typically validated on image databases with a single distortion stage. In this work, we construct two large-scale image databases that are composed of more than 2 million images undergoing multiple stages of distortions and examine how state-of-the-art IQA algorithms behave over distortion stages.


High Dynamic Range (HDR) Wide Color Gamut (WCG) Ultra High Definition (4K/UHD) content has become increasingly popular recently. Due to the increased data rate, novel video compression methods have been developed to maintain the quality of the videos being delivered to consumers under bandwidth constraints. This has led to new challenges for the development of objective Video Quality Assessment (VQA) models, which are traditionally designed without sufficient calibration and validation based on subjective quality assessment of UHD-HDR-WCG videos.


Reliably predicting where people look in images and videos remains challenging and requires substantial eye-tracking data to be collected and analysed for various applications. In this paper, we present an eye-tracking study where twenty-eight participants viewed forty still scenes of video advertising. First, we analyse human attentional behaviour based on gaze data. Then, we evaluate to what extent a machine – saliency model – can predict human behaviour. Experimental results show that there is a significant gap between human and machine in visual saliency.


Our previous study has shown that image distortions cause saliency distraction, and that visual saliency of a distorted image differs from that of its distortion-free reference. Being able to measure such distortion-induced saliency variation (DSV) significantly benefits algorithms for automated image quality assessment. Methods of quantifying DSV, however, remain unexplored due to the lack of a benchmark. In this paper, we build a benchmark for the measurement of DSV through a subjective study.


Existing blind evaluators for screen content images (SCIs) are mainly learning-based and require a number of training images with co-registered human opinion scores. However, the size of existing databases is small, and it is labor-, timeconsuming and expensive to largely generate human opinion scores. In this study, we propose a novel blind quality evaluator without training.


Facial attractiveness prediction has drawn considerable attention from image processing community.
Despite the substantial progress achieved by existing works, various challenges remain.
One is the lack of accurate representation for facial composition, which is essential for attractiveness evaluation. In this paper, we propose to use pixel-wise labelling masks as the meta information of facial composition, and input them into a network for learning high-level semantic representations.