Sorry, you need to enable JavaScript to visit this website.

We introduce the problem of computing a parsing where each phrase is of length at most m and which minimizes the zeroth order entropy of parsing. Based on the recent theoretical results we devise a heuristic for this problem. The solution has straightforward application in succinct text representations and gives practical improvements. Moreover the proposed heuristic yields structure which size can be bounded both by |S|H_{m-1}(S) and by |S|/m(H_0(S) + ... + H_{m-1}(S)),where H_{k}(S) is the k-th order empirical entropy of S.

Categories:
38 Views

This paper introduces the de Bruijn graph edge minimization problem, which is related to the compression of de Bruijn graphs: find the order-k de Bruijn graph with minimum edge count among all orders. We describe an efficient algorithm that solves this problem. Since the edge minimization problem is connected to the BWT compression technique called "tunneling", the paper also describes a way to minimize the length of a tunneled BWT in such a way that useful properties for sequence analysis are preserved.

Categories:
139 Views

Huffman coding is known to be optimal in case the alphabet is known in advance, the set of codewords is fixed and each codeword consists of an integral number of bits. If one of these conditions is violated, optimality is not guaranteed.
In the {\it dynamic\/} variant of Huffman coding the encoder and decoder maintain identical copies of the model; at each position, the model consists of the frequencies of the elements processed so far.

Categories:
121 Views

We extend the binary Fibonacci code to $d$-ary codes, with $d\ge 2$.
This is motivated by future technological developments in which the basic unit of storage will not be just a 2-valued bit, but possibly an element that is able to distinguish between $d$ different values.
The proposed codes are prefix-free, complete and more robust than Huffman codes. Experimental results illustrate that the compression efficiency of non-binary Fibonacci codes are very close to the savings achieved by the corresponding non-binary Huffman coding of the same order.

Categories:
29 Views

We extend the binary Fibonacci code to $d$-ary codes, with $d\ge 2$.
This is motivated by future technological developments in which the basic unit of storage will not be just a 2-valued bit, but possibly an element that is able to distinguish between $d$ different values.
The proposed codes are prefix-free, complete and more robust than Huffman codes. Experimental results illustrate that the compression efficiency of non-binary Fibonacci codes are very close to the savings achieved by the corresponding non-binary Huffman coding of the same order.

Categories:
28 Views

The successor and predecessor problem consists of obtaining the closest value in a set of integers, greater or smaller than a given value. This problem has interesting applications, like the intersection of inverted lists. It can be easily modeled by using a bitvector of size n and its operations rank and select. However, there is a practical approach [1], which keeps the best theoretical bounds, and allows to solve successor and predecessor more efficiently.

Categories:
211 Views

Time series data compression is emerging as an important problem with the growth in IoT devices and sensors. Due to the presence of noise in these datasets, lossy compression can often provide significant compression gains without impacting the performance of downstream applications. In this work, we propose an error-bounded lossy compressor, LFZip, for multivariate floating-point time series data that provides guaranteed reconstruction up to user-specified maximum absolute error.

Categories:
44 Views

This work outlines a method for an application of empirical Bayes in the setting of semi-supervised learning. That is, we consider a scenario in which the training set is partially or entirely unlabeled. In addition to the missing labels, we also consider a scenario where the available training data might be shuffled (i.e., the features and labels are not matched).

Categories:
56 Views

Pages