Re: Avoiding hash join batch explosions with extreme skew and weird stats

Robert Haas Fri, 07 Jun 2019 07:48:35 -0700

On Fri, Jun 7, 2019 at 10:17 AM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> Yes, they could get quite big, and I think you're right we need to
> keep that in mind, because it's on the outer (often quite large) side of
> the join. And if we're aiming to restrict memory usage, it'd be weird to
> just ignore this.
>
> But I think Thomas Munro originally proposed to treat this as a separate
> BufFile, so my assumption was each worker would simply rewrite the bitmap
> repeatedly for each hash table fragment. That means a bit more I/O, but as
> those files are buffered and written in 8kB pages, with just 1 bit per
> tuple. I think that's pretty OK and way cheaper that rewriting the whole
> batch, where each tuple can be hundreds of bytes.


Yes, this is also my thought.  I'm not 100% sure I understand
Melanie's proposal, but I think that it involves writing every
still-unmatched outer tuple for every inner batch.  This proposal --
assuming we can get the tuple numbering worked out -- involves writing
a bit for every outer tuple for every inner batch.  So each time you
do an inner batch, you write either (a) one bit for EVERY outer tuple
or (b) the entirety of each unmatched tuple.  It's possible for the
latter to be cheaper if the number of unmatched tuples is really,
really tiny, but it's not very likely.

For example, suppose that you've got 4 batches and each batch matches
99% of the tuples, which are each 50 bytes wide.  After each batch,
approach A writes 1 bit per tuple, so a total of 4 bits per tuple
after 4 batches.  Approach B writes a different amount of data after
each batch.  After the first batch, it writes 1% of the tuples, and
for each one written it writes 50 bytes, so it writes 50 bytes * 0.01
= ~4 bits/tuple.  That's already equal to what approach A wrote after
all 4 batches, and it's going to do a little more I/O over the course
of the remaining matches - although not much, because the unmatched
tuples file will be very very tiny after we eliminate 99% of the 1%
that survived the first batch.  However, these are extremely favorable
assumptions for approach B.  If the tuples are wider or the batches
match only say 20% of the tuples, approach B is going to be waaaay
more I/O.

Assuming I understand correctly, which I may not.

> Also, it does not require any concurrency control, which rewriting the
> batches themselves probably does (because we'd be feeding the tuples into
> some shared file, I suppose). Except for the final step when we need to
> merge the bitmaps, of course.

I suppose that rewriting the batches -- or really the unmatched tuples
file -- could just use a SharedTuplestore, so we probably wouldn't
need a lot of new code for this.  I don't know whether contention
would be a problem or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Reply via email to