Re: Avoiding hash join batch explosions with extreme skew and weird stats

Thomas Munro Thu, 05 Sep 2019 22:35:49 -0700

On Wed, Jul 31, 2019 at 6:47 AM Melanie Plageman
<[email protected]> wrote:
> So, I've rewritten the patch to use a BufFile for the outer table
> batch file tuples' match statuses and write bytes to and from the file
> which start as 0 and, upon encountering a match for a tuple, I set its
> bit in the file to 1 (also rebased with current master).
>
> It, of course, only works for parallel-oblivious hashjoin -- it relies
> on deterministic order of tuples encountered in the outer side batch
> file to set the right match bit and uses a counter to decide which bit
> to set.
>
> I did the "needlessly dumb implementation" Robert mentioned, though,
> I thought about it and couldn't come up with a much smarter way to
> write match bits to a file. I think there might be an optimization
> opportunity in not writing the current_byte to the file each time that
> the outer tuple matches and only doing this once we have advanced to a
> tuple number that wouldn't have its match bit in the current_byte. I
> didn't do that to keep it simple, and, I suspect there might be a bit
> of gymnastics needed to make sure that that byte is actually written
> to the file in case we exit from some other state before we encounter
> the tuple represented in the last bit in that byte.


Thanks for working on this! I plan to poke at it a bit in the next few weeks.

> I plan to work on a separate implementation for parallel hashjoin
> next--to understand what is required. I believe the logic to decide
> when to fall back should be fairly easy to slot in at the end once
> we've decided what that logic is.

Seems like a good time for me to try to summarise what I think the
main problems are here:

1.  The match-bit storage problem already discussed.  The tuples that
each process receives while reading from SharedTupleStore are
non-deterministic (like other parallel scans).  To use a bitmap-based
approach, I guess we'd need to invent some way to give the tuples a
stable identifier within some kind of densely packed number space that
we could use to address the bitmap, or take the IO hit and write all
the tuples back.  That might involve changing the way SharedTupleStore
holds data.

2.  Tricky problems relating to barriers and flow control.  First, let
me explain why PHJ doesn't support full/right outer joins yet.  At
first I thought it was going to be easy, because, although the shared
memory hash table is read-only after it has been built, it seems safe
to weaken that only slightly and let the match flag be set by any
process during probing: it's OK if two processes clobber each other's
writes, as the only transition is a single bit going strictly from 0
to 1, and there will certainly be a full memory barrier before anyone
tries to read those match bits.  Then during the scan for unmatched,
you just have to somehow dole out hash table buckets or ranges of
buckets to processes on a first-come-first-served basis.  But.... then
I crashed into the following problem:

* You can't begin the scan for unmatched tuples until every process
has finished probing (ie until you have the final set of match bits).
* You can't wait for every process to finish probing, because any
process that has emitted a tuple might never come back if there is
another node that is also waiting for all processes (ie deadlock
against another PHJ doing the same thing), and probing is a phase that
emits tuples.

Generally, it's not safe to emit tuples while you are attached to a
Barrier, unless you're only going to detach from it, not wait at it,
because emitting tuples lets the program counter escape your control.
Generally, it's not safe to detach from a Barrier while accessing
resources whose lifetime it controls, such as a hash table, because
then it might go away underneath you.

The PHJ plans that are supported currently adhere to that programming
rule and so don't have a problem: after the Barrier reaches the
probing phase, processes never wait for each other again so they're
free to begin emitting tuples.  They just detach when they're done
probing, and the last to detach cleans up (frees the hash table etc).
If there is more than one batch, they detach from one batch and attach
to another when they're ready (each batch has its own Barrier), so we
can consider the batches to be entirely independent.

There is probably a way to make a scan-for-unmatched-inner phase work,
possibly involving another Barrier or something like that, but I ran
out of time trying to figure it out and wanted to ship a working PHJ
for the more common plan types.  I suppose PHLJ will face two variants
of this problem: (1) you need to synchronise the loops (you can't dump
the hash table in preparation for the next loop until all have
finished probing for the current loop), and yet you've already emitted
tuples, so you're not allowed to wait for other processes and they're
not allowed to wait for you, and (2) you can't start the
scan-for-unmatched-outer until all the probe loops belonging to one
batch are done.  The first problem is sort of analogous to a problem I
faced with batches in the first place, which Robert and I found a
solution to by processing the batches in parallel, and could perhaps
be solved in the same way: run the loops in parallel (if that sounds
crazy, recall that every worker has its own quota of work_mem and the
data is entirely prepartitioned up front, which is why we are able to
run the batches in parallel; in constrast, single-batch mode makes a
hash table with a quota of nparticipants * work_mem).  The second
problem is sort of analogous to the existing scan-for-unmatched-inner
problem that I haven't solved.

I think there may be ways to make that general class of deadlock
problem go away in a future asynchronous executor model where N
streams conceptually run concurrently in event-driven nodes so that
control never gets stuck in a node, but that seems quite far off and I
haven't worked out the details.  The same problem comes up in a
hypothetical Parallel Repartition node: you're not done with your
partition until all processes have run out of input tuples, so you
have to wait for all of them to send an EOF, so you risk deadlock if
they are waiting for you elsewhere in the tree.  A stupid version of
the idea is to break the node up into a consumer part and a producer
part, and put the producer into a subprocess so that its program
counter can never escape and deadlock somewhere in the consumer part
of the plan.  Obviously we don't want to have loads of extra OS
processes all over the place, but I think you can get the same effect
using a form of asynchronous execution where the program counter jumps
between nodes and streams based on readiness, and yields control
instead of blocking.  Similar ideas have been proposed to deal with
asynchronous IO.

-- 
Thomas Munro
https://enterprisedb.com

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Reply via email to