Sorry, you need to enable JavaScript to visit this website.

One of the main limitations in the field of audio signal processing is the lack of large public datasets with audio representations and high-quality annotations due to restrictions of copyrighted commercial music. We present Melon Playlist Dataset, a public dataset of mel-spectrograms for 649,091 tracks and 148,826 associated playlists annotated by 30,652 different tags. All the data is gathered from Melon, a popular Korean streaming service. The dataset is suitable for music information retrieval tasks, in particular, auto-tagging and automatic playlist continuation.

Categories:
2 Views

For omnidirectional videos (ODVs), the existing o-line coding approaches are designed based on the spatial or perceptual distortion in a whole ODV frame, ignoring the fact that subjects can only access viewports. To improve the subjective quality inside the viewports, this paper proposes an o-line viewport-adaptive rate control (RC) approach for ODVs in high eciency video coding (HEVC) framework. Specically, we predict the viewport candidates with importance weights and develop a viewport saliency detection model.

Categories:
36 Views

While the next generation video compression standard, Versatile Video Coding (VVC), provides a superior compression efficiency, its computational complexity dramatically increases. This paper thoroughly analyzes this complexity for both encoder and decoder of VVC Test Model 6, by quantifying the complexity break-down for each coding tool and measuring the complexity and memory requirements for VVC encoding/decoding.

Categories:
49 Views

In this work, we focus on quantifying speaker identity information encoded in the head gestures of speakers, while they narrate a story. We hypothesize that the head gestures over a long duration have speaker-specific patterns. To establish this, we consider a classification problem to identify speakers from head gestures. We represent every head orientation as a triplet of Euler angles and a sequence of head orientations as head gestures.

Categories:
34 Views

Generating accurate ground truth representations of human subjective experiences and judgements is essential for advancing our understanding of human-centered constructs such as emotions. Often, this requires the collection and fusion of annotations from several people where each one is subject to valuation disagreements, distraction artifacts, and other error sources.

Categories:
21 Views

Segmenting a document image into text-lines and words finds applications in many research areas of DIA(Document Image Analysis) such as OCR, Word Spotting, and document retrieval. However, carrying out segmentation operation directly in the compressed document images is still an unexplored and challenging research area. Since JPEG is most widely accepted compression algorithm, this research paper attempts to segment a JPEG compressed printed text document image into text-lines and words, without fully decompressing the image.

Categories:
50 Views

Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer’s video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale.

Categories:
23 Views

Pages