On 02/21/2018 02:10 AM, Peter Geoghegan wrote: > On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >>> I suspect that it could make sense to use a Bloom filter to >>> summarize the entire inner side of the join all at once, even >>> when there are multiple batches. I also suspect that this is >>> particularly beneficial with parallel hash joins, where >>> IPC/synchronization overhead can be a big problem. >>> >> >> But that's what the patch does, currently - the filter is built >> during the initial pass through the data, and then used for all >> batches. > > I misunderstood. I would probably do something like double or triple > the original rows estimate instead, though. The estimate must be at > least slightly inaccurate when we get to this point, but I don't > think that that's a good enough reason to give up on the estimate > completely. >
That's a problem only for the multi-batch case, though. With a single batch we can walk the hash table and count non-empty buckets, to get a good ndistinct estimate cheaply. And then size the filter considering both memory requirements (fits into CPU cache) and false positive rate. There are other things we may need to consider (memory usage vs. work_mem) but that's a separate issue. With multiple batches I think we could use the "size the bloom filter for a fraction of work_mem" which the current patch uses when switching to multiple batches halfway-through. That pretty much entirely ignores the estimate and essentially replaces it with a "fictional" estimate. I think that's a better approach than using some arbitrary multiple of the estimate. When we have to start batching halfway through, the estimate is proven to be rather bogus anyway, but we may treat it as a lower boundary for the bloom filter size. >> Actually, now that I think about it - I think the patch should >> throw away the filter away after the initial pass over the outer >> relation, because at that point we've used all the information in >> the filter. > > Makes sense. > Actually, the patch already does that - it stops using the filter if (curbatch != 0). We don't throw it away, though, because it also includes some additional instrumentation that are shown by explain analyze. >> I'm not sure it would make sense to then build smaller bloom >> filters for individual batches, but maybe it would? > > I doubt it. > I think it might help if the global bloom filter ended up having high false positive rate. But only if the per-batch filters fit into CPU cache (i.e. it's the same reasoning as for single-batch case). But those "per-batch" filters are rather incompatible with pushing the filter to scan nodes, I think. >> Yeah, I admit those are rather crude rules. > > You have to start somewhere. > >> The trouble is that when we start with a single batch and then find >> out the estimate was wrong and we need to start batching, all bets >> are off. At that point it seems reasonable to just say "Here is X >> MBs of RAM, do what you can". > > As I said above, I wouldn't say all bets are off when this happens > -- not at all. Estimates are likely to often be somewhat wrong. If > they're completely wrong, we can probably swallow the cost of giving > up on a Bloom filter relatively late. > > As I said, X should not be a portion of work_mem, because that has > only a weak relationship to what really matters. > I agree a fixed fraction of work_mem may not be the right thing, but the goal was to make the bloom filter part of the Hash memory budget, i.e. bloom filter + hash table <= work_mem (which I think we agree should be the case), without increasing the number of batches too much. For example, if you size the filter ignoring this, and it end up being 90% of work_mem, you may need to do the hash join in 128 batches instead of just 16. Or something like that. Maybe that would still be a win, though. Firstly, the higher number of batches may not have a huge impact - in one case we need to serialie 15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more accurate filter allows us to discard much more data from the outer relation ... >>> You should try to exploit the fact that a Bloom filter can summarize a >>> large set reasonably well with a very compact, simple representation. >>> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but >>> for cases where Bloom probes will save a lot of work, it probably >>> doesn't make all that much difference -- the hash join is still much >>> faster. > >> But the problem is that I don't know what is the total size of the >> hash table, because we're building the bloom filter for all the >> batches at once. And we don't know how many batches will be there - >> if we knew that, we could estimate the number of distinct values >> and we could use that to size the filter instead of doing this. >> (All of this only applies to the state where we start with a single >> batch and then find out we need to start batching, of course.) > > I don't think that the number of batches should matter that much to > the costing/sizing/estimation logic, even if it's an interesting > thing to talk about when considering the upsides of a Bloom filter. > My sense is that we should focus on making sure that using a Bloom > filter is going to win, and not worry so much about whether that's > going to be a huge win or a modest win. > > Suppose that you thought you were going to have a 10% false positive > rate with a 22.85MB Bloom filter for 40 million elements (my earlier > example). Further suppose that it turns out to be 80 million > elements. This means that you're going to get a false positive rate > of 30%. This could still be a big win, though, so it's not really a > bad situation. With 120 million elements, it's about 45%, but even > then it could still work out, especially because the hash join will > be very slow anyway. You also have to bear in mind that the 40 > million estimate is much more likely to be too high than too low, > because you're assuming distinct key values for the hash table. > > You're taking a risk, in a sense, but a lot of things have to go > wrong for you to lose, and even then you can cut your losses before > the extra cost is too high. > > Do you have a test query for this patch, that gives us some idea of > the upside? > I have to admit I've been using only some simplistic ad-hoc queries. There was a more detailed analysis in the old thread, but I'm not sure how relevant it still is. So I did some measurements on a simple join, with different work_mem values and join selectivity. select count(*) from fact join dim on (dim.a = fact.a and dim.b < :sel) Attached are results for "small" data set 20M:1M, the full results and scripts are available here: https://github.com/tvondra/hashjoin-bloom-tests The first list shows a summary of the results, with timings for a) master b) patched master (with bloom filters disabled) c) patched master (with bloom filters used always) The last two tables compare b/a and c/a. The first one shows that there's 0-3% overhead when bloom filters are not used (but it might easily be just noise or differences in the layout of the binary). The second one is the more interesting one. It shows a couple of things: a) For tiny hash tables there's about 11% overhead. 1% selectivity means the hash table has only 10000 entries, which fits into ~483kB. This is why I think we need rule that for small hash tables we don't need bloom filters. b) For low selectivity (70% or more rows get into the hash table), the bloom filter is a net loss, costing up to ~11%. This is why we should consider selectivity of the join, I think. c) For selectivity between 10% and 70% it's a net win, with speedups between ~10% (single batch) and ~20% (multi-batch). Those are relatively modest improvements, I expect more significant gains on the large data set. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
summary.ods
Description: application/vnd.oasis.opendocument.spreadsheet