Fast and Compact Set Intersection through Recursive Universe Partitioning

We present a data structure that encodes a sorted integer sequence in small space allowing, at the same time, fast intersection operations. The data layout is carefully designed to exploit word-level parallelism and SIMD instructions, hence providing good practical performance. The core algorithmic idea is that of recursive partitioning the universe of representation: a markedly different paradigm than the widespread strategy of partitioning the sequence based on its length. Extensive experimentation and comparison against several competitive techniques shows that the proposed solution embodies an improved space/time trade-off for the set intersection problem.

DCC21.pdf

Presentation slides. (581)

Thumbs Up

Comments

Static Setting

Permalink Submitted by Dominik Koeppl on 12 March 2021 - 5:52am

Thanks for the presentation!
The setting reminded me of the standard approach of dealing with rank and select support data structures on bit vectors,
where the common choice for partitioning is universe partitioning and cardinality partitioning for rank and for select, respectively.
The idea of treating dense and sparse slices separately also reminds me of the sd-array of Okanohara and Sadakane [ALENEX'07],
although I do not know whether we can compute intersection queries in this representation efficiently.

In your setting, you have a set S_1 and compare it with a set S_2. Although S_1 and S_2 are static, you support the creation of another set S_3.
If you fix the number of sets, say you have a set of sets S' = { S_1, S_2 , ...}, does such a static setting allow you to achieve better bounds
by exploiting the similarity of certain subsets of S' (for instance by creating clusters of similar sets)?

Hi Dominik and thanks for

Permalink Submitted by Giulio Ermanno ... on 22 March 2021 - 9:04am

Hi Dominik and thanks for posting. I have just noticed your comment.
Yes, the data structure is intentionally simple to make it suitable for
practical efficiency. Furthermore, the byte-granularity allows you to
you SIMD instructions and other tricky built-ins.
The partitioning-by-universe approach works best for inverted indexes
given that Successor queries are more important than Access there.
Clearly, knowing that the sets to intersect share some similarities helps
the algorithm. I suppose the clustering must be performed offline (at
indexing time) rather than at query time for efficiency reasons.

Thanks for your reply!

Permalink Submitted by Dominik Koeppl on 28 March 2021 - 3:07am

Thanks for your reply!
With "works best" you mean that while successor is harder to compute than access, you can compute successor faster in partitioning-by-universe than by cardinality-partitioning?
For the subset similarity problem: Yes, I though about performing the clustering offline while building the index. Maybe this is relevant for practical problems?

Yes, by "works best" I mean

Permalink Submitted by Giulio Ermanno ... on 28 March 2021 - 8:49am

Yes, by "works best" I mean that universe partitioning is more suitable for successor queries.
And, yes, everything done here is meant to perform well in practical situations.
I think that deriving a good theoretical bound (i.e., close to practice) is hard without
precise assumptions on the data.

CITE

Documents

Research Manuscript