Sorry, you need to enable JavaScript to visit this website.

One of the main limitations in the field of audio signal processing is the lack of large public datasets with audio representations and high-quality annotations due to restrictions of copyrighted commercial music. We present Melon Playlist Dataset, a public dataset of mel-spectrograms for 649,091 tracks and 148,826 associated playlists annotated by 30,652 different tags. All the data is gathered from Melon, a popular Korean streaming service. The dataset is suitable for music information retrieval tasks, in particular, auto-tagging and automatic playlist continuation.


For omnidirectional videos (ODVs), the existing o-line coding approaches are designed based on the spatial or perceptual distortion in a whole ODV frame, ignoring the fact that subjects can only access viewports. To improve the subjective quality inside the viewports, this paper proposes an o-line viewport-adaptive rate control (RC) approach for ODVs in high eciency video coding (HEVC) framework. Specically, we predict the viewport candidates with importance weights and develop a viewport saliency detection model.


While the next generation video compression standard, Versatile Video Coding (VVC), provides a superior compression efficiency, its computational complexity dramatically increases. This paper thoroughly analyzes this complexity for both encoder and decoder of VVC Test Model 6, by quantifying the complexity break-down for each coding tool and measuring the complexity and memory requirements for VVC encoding/decoding.


In this work, we focus on quantifying speaker identity information encoded in the head gestures of speakers, while they narrate a story. We hypothesize that the head gestures over a long duration have speaker-specific patterns. To establish this, we consider a classification problem to identify speakers from head gestures. We represent every head orientation as a triplet of Euler angles and a sequence of head orientations as head gestures.


Generating accurate ground truth representations of human subjective experiences and judgements is essential for advancing our understanding of human-centered constructs such as emotions. Often, this requires the collection and fusion of annotations from several people where each one is subject to valuation disagreements, distraction artifacts, and other error sources.


Segmenting a document image into text-lines and words finds applications in many research areas of DIA(Document Image Analysis) such as OCR, Word Spotting, and document retrieval. However, carrying out segmentation operation directly in the compressed document images is still an unexplored and challenging research area. Since JPEG is most widely accepted compression algorithm, this research paper attempts to segment a JPEG compressed printed text document image into text-lines and words, without fully decompressing the image.