Sorry, you need to enable JavaScript to visit this website.

Fast and Compact Set Intersection through Recursive Universe Partitioning

Primary tabs

Citation Author(s):
Giulio Ermanno Pibiri
Submitted by:
Giulio Ermanno ...
Last updated:
28 February 2021 - 12:40pm
Document Type:
Research Manuscript
Document Year:
2021
Event:
Presenters Name:
Giulio Ermanno Pibiri

Abstract 

Abstract: 

We present a data structure that encodes a sorted integer sequence in small space allowing, at the same time, fast intersection operations. The data layout is carefully designed to exploit word-level parallelism and SIMD instructions, hence providing good practical performance. The core algorithmic idea is that of recursive partitioning the universe of representation: a markedly different paradigm than the widespread strategy of partitioning the sequence based on its length. Extensive experimentation and comparison against several competitive techniques shows that the proposed solution embodies an improved space/time trade-off for the set intersection problem.

up
0 users have voted:

Comments

Thanks for the presentation!
The setting reminded me of the standard approach of dealing with rank and select support data structures on bit vectors,
where the common choice for partitioning is universe partitioning and cardinality partitioning for rank and for select, respectively.
The idea of treating dense and sparse slices separately also reminds me of the sd-array of Okanohara and Sadakane [ALENEX'07],
although I do not know whether we can compute intersection queries in this representation efficiently.

In your setting, you have a set S_1 and compare it with a set S_2. Although S_1 and S_2 are static, you support the creation of another set S_3.
If you fix the number of sets, say you have a set of sets S' = { S_1, S_2 , ...}, does such a static setting allow you to achieve better bounds
by exploiting the similarity of certain subsets of S' (for instance by creating clusters of similar sets)?

Hi Dominik and thanks for posting. I have just noticed your comment.
Yes, the data structure is intentionally simple to make it suitable for
practical efficiency. Furthermore, the byte-granularity allows you to
you SIMD instructions and other tricky built-ins.
The partitioning-by-universe approach works best for inverted indexes
given that Successor queries are more important than Access there.
Clearly, knowing that the sets to intersect share some similarities helps
the algorithm. I suppose the clustering must be performed offline (at
indexing time) rather than at query time for efficiency reasons.

Thanks for your reply!
With "works best" you mean that while successor is harder to compute than access, you can compute successor faster in partitioning-by-universe than by cardinality-partitioning?
For the subset similarity problem: Yes, I though about performing the clustering offline while building the index. Maybe this is relevant for practical problems?

Yes, by "works best" I mean that universe partitioning is more suitable for successor queries.
And, yes, everything done here is meant to perform well in practical situations.
I think that deriving a good theoretical bound (i.e., close to practice) is hard without
precise assumptions on the data.

Dataset Files

Presentation slides.

(106)