Thank you for sharing, Wes, an interesting paper indeed.

In Rust we currently use a different strategy.

We build an iterator over ranges [a_i, b_i[ to be selected from the
filter bitmap, and filter the array based on those ranges. For a
single filter, the ranges are iterated as they are being built; for
multiple filters, we persist all ranges ([a_0, b_0[, [a_1, b_1[, ...),
and then apply them to each array (potentially in parallel).

The main advantages are:
1. it allows memcopying values buffers in slices instead of individual
items, thereby reducing the total number of memcopies performed.
2. When the bitmap has no offsets, we efficiently skip chunks via a
simple comparison (byte == 0)
3. it lends itself well to clustered data (i.e. when there are
clusters of selections in the data)

The best case scenario is both [0, 0, 0, 0] and [1, 1, 1, 1]; worst
case scenario is when the selectivity is [0, 1, 0, 1, 0, 1], which is
more expensive than "take"ing, as there are extra cycles in deriving
the ranges.
I do not have the numbers, but there was a performance improvement
that motivated us to switch strategies.

Besides the obvious memcopies, my understanding is that this is not so
much dependent on the selectivity, but rather on the number of
switches throughout the bitmap.

Best,
Jorge

On Wed, Jun 23, 2021 at 2:05 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> Some on this list might be interested in a new paper out of CMU/MIT
> about the use of selection vectors and bitmaps for handling the
> intermediate results of filters:
>
> https://db.cs.cmu.edu/papers/2021/ngom-damon2021.pdf
>
> The research was done in the context of NoisePage which uses Arrow as
> its memory format. I found some of the observations related to AVX512
> to be interesting.

Reply via email to