Counting with Prediction: Rank and Select Queries with Adjusted Anchoring

Rank and select queries are the fundamental building blocks of the compressed data structures. On a given bit string of length n, counting the number of set bits up to a certain position is named as the rank, and finding the position of the kth set bit is the select query. We present a new data structure and the procedures on it to support rank/select operations. The proposed scheme introduces ( log 2m + log n ) overhead bits per each bit over d s·d the n–bits long input bit string, where d is the inner-block size in bits, s is the number of inner-blocks in a super-block, and m is a properly chosen constant modulus value. When compared to the previous two-level hierarchical data structures that generate ( log(s·d) + log n ) d s·d overhead bits per bit, the new approach reduces the space consumption significantly with proper selection of the parameters. With the new data structure, the rank queries are usually (≈ 90% of the time) executed in O(td) time, where O(td) is the time required to compute a rank in an inner-block of length d–bits, which is assumed to be constant via the wide-register instructions in modern processors. Seldom, it may require to investigate more than one block, where on average this is observed to be around two blocks, empirically. We provide probabilistic analyses on how to choose the appropriate parameters and present several trade–offs to guarantee constant-time rank. We also investigate using the same data structure to support the select queries as well. Experimental evaluation of the introduced scheme revealed that the proposed data structure consumes nearly 30%−50% less space than its alternatives by introducing less than 5% overhead, while the speed is either better or very competitive when compared with the current state–of the art implementations both in terms of rank and select.

dcc22kulekci.pdf

presentation slides DCC22 (381)

Links:

Implementation on GitHub

Thumbs Up

Comments

Using binary search

Permalink Submitted by Dominik Koeppl on 19 March 2022 - 7:07am

Thanks for your clear and detailed explanations on this work!
I would like to pose two questions.

For your linear regression $(ax+b)$, are the 32-bit values for $a$ and $b$ integers or floating points?

For the select operation, maybe you can also learn a threshold for when to prefer binary search instead of a linear scan.
Assuming that your evaluation is based on the average time for select,
I think you found out that a linear scan is faster on the average case, but preferring binary search for some bad cases could improve the performance for these outliers, improving the overall average performance.

CITE

Documents

Presentation Slides

Counting with Prediction: Rank and Select Queries with Adjusted Anchoring

dcc22kulekci.pdf

Comments

Using binary search

QUESTIONS?