Re: Hash Joins vs. Bloom Filters / take 2

Tomas Vondra Thu, 22 Feb 2018 07:46:37 -0800


On 02/22/2018 12:44 PM, Claudio Freire wrote:
> On Wed, Feb 21, 2018 at 11:21 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> On 02/21/2018 02:10 AM, Peter Geoghegan wrote:
>>> ...
>>> I misunderstood. I would probably do something like double or triple
>>> the original rows estimate instead, though. The estimate must be at
>>> least slightly inaccurate when we get to this point, but I don't
>>> think that that's a good enough reason to give up on the estimate
>>> completely.
>>>
>>
>> That's a problem only for the multi-batch case, though.
>>
>> With a single batch we can walk the hash table and count non-empty
>> buckets, to get a good ndistinct estimate cheaply. And then size the
>> filter considering both memory requirements (fits into CPU cache) and
>> false positive rate. There are other things we may need to consider
>> (memory usage vs. work_mem) but that's a separate issue.
>>
>> With multiple batches I think we could use the "size the bloom filter
>> for a fraction of work_mem" which the current patch uses when switching
>> to multiple batches halfway-through. That pretty much entirely ignores
>> the estimate and essentially replaces it with a "fictional" estimate.
>>
>> I think that's a better approach than using some arbitrary multiple of
>> the estimate. When we have to start batching halfway through, the
>> estimate is proven to be rather bogus anyway, but we may treat it as a
>> lower boundary for the bloom filter size.
> 
> ...
> 
>>> As I said, X should not be a portion of work_mem, because that has
>>> only a weak relationship to what really matters.
>>>
>>
>> I agree a fixed fraction of work_mem may not be the right thing, but the
>> goal was to make the bloom filter part of the Hash memory budget, i.e.
>>
>>     bloom filter + hash table <= work_mem
>>
>> (which I think we agree should be the case), without increasing the
>> number of batches too much. For example, if you size the filter ignoring
>> this, and it end up being 90% of work_mem, you may need to do the hash
>> join in 128 batches instead of just 16. Or something like that.
>>
>> Maybe that would still be a win, though. Firstly, the higher number of
>> batches may not have a huge impact - in one case we need to serialie
>> 15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more
>> accurate filter allows us to discard much more data from the outer
>> relation ...
> 
> Let me reiterate, you can avoid both issues with scalable bloom filters[1].
>


I'm afraid it's not as straight-forward as "Use scalable bloom filters!"

This is not merely a question of unreliable estimates of number of
entries. That could have been solved by scalable bloom filters, which
are essentially a sequence of larger and larger bloom filters, added
when the smaller bloom filter "fills up" (1/2 full).

The problem is twofold:

(a) we need to decide what false positive rate to use (i.e. what is a
reasonable trade-off between filter size and how much work it saves)

(b) we also need to consider work_mem (which I assume we all agree we
must respect)

So for example we can't size the first bloom filter to just perfectly
fit into work_mem, only to add larger bloom filters later (each 2x the
size of the previous one). Not only will that increase the memory usage
to 7x the initial estimate, but it will also make the bloom filter less
efficient (having to probe larger and larger filters, likely not fitting
into CPU cache).

> An HLL can be used to estimate set size, the paper makes no mention of
> it, probably assuming only distinct items are added to the set.
> 

The problem with HLL is, it's only an estimate of how many entries you
saw so far. It only tells you that after observing the items, and it
only tells you how many items you saw so far. What we need for sizing a
bloom filter is an estimate of number of distinct values in advance.

In other words, HLL is entirely useless for sizing Bloom Filters.

Furthermore, we could estimate number of observed distinct values from
the number of 1s in the bloom filter - we essentially ask "How many
items we observed if each item sets k random bits, and we have K bits
sets?" HLL does the same thing, but it throws away the ability to answer
which elements are in the set.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hash Joins vs. Bloom Filters / take 2

Reply via email to