On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowle...@gmail.com> wrote:

On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> This reminds me our attempts to add bloom filters to hash joins, which I
> think ran into mostly the same challenge of deciding when the bloom
> filter can be useful and is worth the extra work.

Speaking of that, it would be interesting to see how a test where you
write the query as IN(VALUES(...)) instead of IN() compares. It would
be interesting to know if the planner is able to make a more suitable
choice and also to see how all the work over the years to improve Hash
Joins compares to the bsearch with and without the bloom filter.

It would be interesting.

It also makes one wonder about optimizing these into to hash
joins...which I'd thought about over at [1]. I think it'd be a very
significant effort though.


I modified the script to also do the join version of the query. I can
only run it on my laptop at the moment, so the results may be a bit
different from those I shared before, but it's interesting I think.

In most cases it's comparable to the binsearch/bloom approach, and in
some cases it actually beats them quite significantly. It seems to
depend on how expensive the comparison is - for "int" the comparison is
very cheap and there's almost no difference. For "text" the comparison
is much more expensive, and there are significant speedups.

For example the test with 100k lookups in array of 10k elements and 10%
match probability, the timings are these

  master:  62362 ms
  binsearch: 201 ms
  bloom:      65 ms
  hashjoin:   36 ms

I do think the explanation is fairly simple - the bloom filter
eliminates about 90% of the expensive comparisons, so it's 20ms plus
some overhead to build and check the bits. The hash join probably
eliminates a lot of the remaining comparisons, because the hash table
is sized to have one tuple per bucket.

Note: I also don't claim the PoC has the most efficient bloom filter
implementation possible. I'm sure it could be made faster.

Anyway, I'm not sure transforming this to a hash join is worth the
effort - I agree that seems quite complex. But perhaps this suggest we
should not be doing binary search and instead just build a simple hash
table - that seems much simpler, and it'll probably give us about the
same benefits.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment: test-saop-hashjoin.sh
Description: Bourne shell script

Attachment: saop-hashjoin.ods
Description: application/vnd.oasis.opendocument.spreadsheet

Reply via email to