Re: Avoiding hash join batch explosions with extreme skew and weird stats

Tomas Vondra Tue, 04 Jun 2019 15:32:31 -0700

On Tue, Jun 04, 2019 at 03:08:24PM -0400, Robert Haas wrote:

On Tue, Jun 4, 2019 at 2:47 PM Melanie Plageman
<melanieplage...@gmail.com> wrote:

Oops! You are totally right.
I will amend the idea:
For each chunk on the inner side, loop through both the original batch
file and the unmatched outer tuples file created for the last chunk.
Emit any matches and write out any unmatched tuples to a new unmatched
outer tuples file.


I think, in the worst case, if no tuples from the outer have a match,
you end up writing out all of the outer tuples for each chunk on the
inner side. However, using the match bit in the tuple header solution
would require this much writing.
Probably the bigger problem is that in this worst case you would also
need to read double the number of outer tuples for each inner chunk.

However, in the best case it seems like it would be better than the
match bit/write everything from the outer side out solution.


I guess so, but the downside of needing to read twice as many outer
tuples for each inner chunk seems pretty large.  It would be a lot
nicer if we could find a way to store the matched-bits someplace other
than where we are storing the tuples, what Thomas called a
bits-in-order scheme, because then the amount of additional read and
write I/O would be tiny -- one bit per tuple doesn't add up very fast.

In the scheme you propose here, I think that after you read the
original outer tuples for each chunk and the unmatched outer tuples
for each chunk, you'll have to match up the unmatched tuples to the
original tuples, probably by using memcmp() or something.  Otherwise,
when a new match occurs, you won't know which tuple should now not be
emitted into the new unmatched outer tuples file that you're going to
produce.  So I think what's going to happen is that you'll read the
original batch file, then read the unmatched tuples file and use that
to set or not set a bit on each tuple in memory, then do the real work
setting more bits, then write out a new unmatched-tuples file with the
tuples that still don't have the bit set.  So your unmatched tuple
file is basically a list of tuple identifiers in the least compact
form imaginable: the tuple is identified by the entire tuple contents.
That doesn't seem very appealing, although I expect that it would
still win for some queries.

I wonder how big of an issue that actually is in practice. If this ismeant for significantly skewed data sets, which may easily cause OOM

(e.g. per the recent report, which restarted this discussion). So if we
still only expect to use this for rare cases, which may easily end up
with an OOM at the moment, the extra cost might be acceptable.

But if we plan to use this more widely (say, allow hashjoins even for
cases that we know won't fit into work_mem), then the extra cost would

be an issue. But even then it should be included in the cost estimate,and switch the plan to a merge join when appropriate.

Of course, maybe there are many data sets with enough skew to consumeexplosive growth and consume a lot of memory, but not enough to triggerOOM. Those cases may get slower, but I think that's OK. If appropriate,

the user can increase work_mem and get the "good" plan.

FWIW this is a challenge for all approaches discussed in this thread,
not just this particular one. We're restricting the resources available
to the query, switching to something (likely) slower.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Reply via email to