On Wed, Feb 21, 2018 at 11:21 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > On 02/21/2018 02:10 AM, Peter Geoghegan wrote: >> On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra >> <tomas.von...@2ndquadrant.com> wrote: >>>> I suspect that it could make sense to use a Bloom filter to >>>> summarize the entire inner side of the join all at once, even >>>> when there are multiple batches. I also suspect that this is >>>> particularly beneficial with parallel hash joins, where >>>> IPC/synchronization overhead can be a big problem. >>>> >>> >>> But that's what the patch does, currently - the filter is built >>> during the initial pass through the data, and then used for all >>> batches. >> >> I misunderstood. I would probably do something like double or triple >> the original rows estimate instead, though. The estimate must be at >> least slightly inaccurate when we get to this point, but I don't >> think that that's a good enough reason to give up on the estimate >> completely. >> > > That's a problem only for the multi-batch case, though. > > With a single batch we can walk the hash table and count non-empty > buckets, to get a good ndistinct estimate cheaply. And then size the > filter considering both memory requirements (fits into CPU cache) and > false positive rate. There are other things we may need to consider > (memory usage vs. work_mem) but that's a separate issue. > > With multiple batches I think we could use the "size the bloom filter > for a fraction of work_mem" which the current patch uses when switching > to multiple batches halfway-through. That pretty much entirely ignores > the estimate and essentially replaces it with a "fictional" estimate. > > I think that's a better approach than using some arbitrary multiple of > the estimate. When we have to start batching halfway through, the > estimate is proven to be rather bogus anyway, but we may treat it as a > lower boundary for the bloom filter size.
... >> As I said, X should not be a portion of work_mem, because that has >> only a weak relationship to what really matters. >> > > I agree a fixed fraction of work_mem may not be the right thing, but the > goal was to make the bloom filter part of the Hash memory budget, i.e. > > bloom filter + hash table <= work_mem > > (which I think we agree should be the case), without increasing the > number of batches too much. For example, if you size the filter ignoring > this, and it end up being 90% of work_mem, you may need to do the hash > join in 128 batches instead of just 16. Or something like that. > > Maybe that would still be a win, though. Firstly, the higher number of > batches may not have a huge impact - in one case we need to serialie > 15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more > accurate filter allows us to discard much more data from the outer > relation ... Let me reiterate, you can avoid both issues with scalable bloom filters[1]. An HLL can be used to estimate set size, the paper makes no mention of it, probably assuming only distinct items are added to the set. [1] http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf