Re: WIP: BRIN multi-range indexes

Tomas Vondra Mon, 21 Sep 2020 12:56:42 -0700

On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:

On Fri, Sep 18, 2020 at 6:27 PM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:

But maybe we could still use this scheme by actually computing

    h1 = hash_uint32_extended(value, seed1);
    h2 = hash_uint32_extended(value, seed2);

and then use this as the independent hash functions. I think that would
meet the requirements of the paper.


Yeah, that would work algorithmically. It would be trivial to add to
the patch, too of course. There'd be a higher up-front cpu cost. Also,
I'm a bit cautious of rehashing hashes, and whether the two values
above are independent enough. I'm not sure either of these points
matters. My guess is the partition approach is more sound, but it has
some minor organizational challenges (see below).


OK. I don't think rehashing hashes is an issue as long as the original
hash has sufficiently low collision rate (and while we know it's not
perfect we know it works well enough for hash indexes etc.). And I doubt
the cost of the extra hash of uint32 would be noticeable.

That being said the partitioning approach might be more sound and it's
definitely worth giving it a try.

Why would 11k bits be more than we'd like to store? Assuming we could
use the whole 8kB page for the bloom filter, that'd be about 64k bits.
In practice there'd be a bit of overhead (page header ...) but it's
still much more than 11k bits.


Brain fade -- I guess I thought we were avoiding being toasted, but
now I see that's not possible for BRIN storage. So, we'll want to
guard against this:

ERROR:  index row size 8160 exceeds maximum 8152 for index "bar_num_idx"

While playing around with the numbers I had an epiphany: At the
defaults, the filter already takes up ~4.3kB, over half the page.
There is no room for another tuple, so if we're only indexing one
column, we might as well take up the whole page.


Hmm, yeah. I may be wrong but IIRC indexes don't support external
storage but compression is still allowed. So even if those defaults are
a bit higher than needed that should make the bloom filters a bit more
compressible, and thus fit multiple BRIN tuples on a single page.

Here MT = max tuples per 128 8k pages, or 37120, so default ndistinct
is 3712.

n      k  m      p
MT/10  7  35580  0.01
MT/10  7  64000  0.0005
MT/10  12 64000  0.00025

Assuming ndistinct isn't way off reality, we get 20x-40x lower false
positive rate almost for free, and it'd be trivial to code! Keeping k
at 7 would be more robust, since it's equivalent to starting with n =
~6000, p = 0.006, which is still almost 2x less false positives than
you asked for. It also means nearly doubling the number of sorted
values before switching.

Going the other direction, capping nbits to 64k bits when ndistinct
gets too high, the false positive rate we can actually support starts
to drop. Here, the user requested 0.001 fpr.

n       k  p
4500    9  0.001
6000    7  0.006
7500    6  0.017
15000   3  0.129  (probably useless by now)
MT      1  0.440
64000   1  0.63 (possible with > 128 pages per range)

I imagine smaller pages_per_range settings are going to be useful for
skinny tables (note to self -- test). Maybe we could provide a way for
the user to see that their combination of pages_per_range, false
positive rate, and ndistinct is supportable, like
brin_bloom_get_supported_fpr(). Or document to check with page_inspect.
And that's not even considering multi-column indexes, like you
mentioned.


I agree giving users visibility into this would be useful.

Not sure about how much we want to rely on these optimizations, though,
considering multi-column indexes kinda break this.

But I guess we can simply make the table of primes a bit larger,
right?


If we want to support all the above cases without falling down
entirely, it would have to go up to 32k to be safe (When k = 1 we could
degenerate to one modulus on the whole filter). That would be a table
of about 7kB, which we could binary search. [thinks for a
moment]...Actually, if we wanted to be clever, maybe we could
precalculate the primes needed for the 64k bit cases and stick them at
the end of the array. The usual algorithm will find them. That way, we
could keep the array around 2kB. However, for >8kB block size, we
couldn't adjust the 64k number, which might be okay, but not really
project policy.

We could also generate the primes via a sieve instead, which is really
fast (and more code). That would be more robust, but that would require
the filter to store the actual primes used, so 20 more bytes at max k =
10. We could hard-code space for that, or to keep from hard-coding
maximum k and thus lowest possible false positive rate, we'd need more
bookkeeping.


I don't think the efficiency of this code matters too much - it's only
used once when creating the index, so the simpler the better. Certainly
for now, while testing the partitioning approach.

So, with the partition approach, we'd likely have to set in stone
either max nbits, or min target false positive rate. The latter option
seems more principled, not only for the block size, but also since the
target fp rate is already fixed by the reloption, and as I've
demonstrated, we can often go above and beyond the reloption even
without changing k.


That seems like a rather annoying limitation, TBH.

>One wrinkle is that the sum of k primes is not going to match m
>exactly. If the sum is too small, we can trim a few bits off of the
>filter bitmap. If the sum is too large, the last partition can spill
>into the front of the first one. This shouldn't matter much in the
>common case since we need to round m to the nearest byte anyway.
>

AFAIK the paper simply says that as long as the sum of partitions is
close to the requested nbits, it's good enough. So I guess we could
just roll with that, no need to trim/wrap or something like that.


Hmm, I'm not sure I understand you. I can see not caring to trim wasted
bits, but we can't set/read off the end of the filter. If we don't
wrap, we could just skip reading/writing that bit. So a tiny portion of
access would be at k - 1. The paper is not clear on what to do here,
but they are explicit in minimizing the absolute value, which could go
on either side.


What I meant is that is that the paper says this:

    Given a planned overall length mp for a Bloom filter, we usually
    cannot get k prime numbers to make their sum mf to be exactly mp. As
    long as the difference between mp and mf is small enough, it neither
    causes any trouble for the software implementation nor noticeably
    shifts the false positive ratio.

Which I think means we can pick mp, generate k primes with sum mf close
to mp, and just use that with mf bits.

Also I found a bug:

+ add_local_real_reloption(relopts, "false_positive_rate", + "desired
false-positive rate for the bloom filters", +
BLOOM_DEFAULT_FALSE_POSITIVE_RATE, + 0.001, 1.0, offsetof(BloomOptions,
falsePositiveRate));

When I set fp to 1.0, the reloption code is okay with that, but then
later the assertion gets triggered.


Hmm, yeah. I wonder what to do about that, considering indexes with fp
1.0 are essentially useless.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: BRIN multi-range indexes

Reply via email to