On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowle...@gmail.com> wrote:On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > This reminds me our attempts to add bloom filters to hash joins, which I > think ran into mostly the same challenge of deciding when the bloom > filter can be useful and is worth the extra work. Speaking of that, it would be interesting to see how a test where you write the query as IN(VALUES(...)) instead of IN() compares. It would be interesting to know if the planner is able to make a more suitable choice and also to see how all the work over the years to improve Hash Joins compares to the bsearch with and without the bloom filter.It would be interesting. It also makes one wonder about optimizing these into to hash joins...which I'd thought about over at [1]. I think it'd be a very significant effort though.
I modified the script to also do the join version of the query. I can only run it on my laptop at the moment, so the results may be a bit different from those I shared before, but it's interesting I think. In most cases it's comparable to the binsearch/bloom approach, and in some cases it actually beats them quite significantly. It seems to depend on how expensive the comparison is - for "int" the comparison is very cheap and there's almost no difference. For "text" the comparison is much more expensive, and there are significant speedups. For example the test with 100k lookups in array of 10k elements and 10% match probability, the timings are these master: 62362 ms binsearch: 201 ms bloom: 65 ms hashjoin: 36 ms I do think the explanation is fairly simple - the bloom filter eliminates about 90% of the expensive comparisons, so it's 20ms plus some overhead to build and check the bits. The hash join probably eliminates a lot of the remaining comparisons, because the hash table is sized to have one tuple per bucket. Note: I also don't claim the PoC has the most efficient bloom filter implementation possible. I'm sure it could be made faster. Anyway, I'm not sure transforming this to a hash join is worth the effort - I agree that seems quite complex. But perhaps this suggest we should not be doing binary search and instead just build a simple hash table - that seems much simpler, and it'll probably give us about the same benefits. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
test-saop-hashjoin.sh
Description: Bourne shell script
saop-hashjoin.ods
Description: application/vnd.oasis.opendocument.spreadsheet