Re: Avoiding hash join batch explosions with extreme skew and weird stats

Tomas Vondra Tue, 23 Jun 2020 15:24:31 -0700

On Mon, Jun 08, 2020 at 05:12:25PM -0700, Melanie Plageman wrote:

On Wed, May 27, 2020 at 7:25 PM Melanie Plageman <melanieplage...@gmail.com>
wrote:

I've attached a rebased patch which includes the "provisionally detach"
deadlock hazard fix approach


Alas, the "provisional detach" logic proved incorrect (see last point in
the list of changes included in the patch at bottom).

Also, we kept the batch 0 spilling patch David Kimura authored [1]
separate so it could be discussed separately because we still had some
questions.


The serial batch 0 spilling is in the attached patch. Parallel batch 0
spilling is still in a separate batch that David Kimura is working on.

I've attached a rebased and updated patch with a few fixes:

- semi-join fallback works now
- serial batch 0 spilling in main patch
- added instrumentation for stripes to the parallel case
- SharedBits uses same SharedFileset as SharedTuplestore
- reverted the optimization to allow workers to re-attach to a batch and
 help out with stripes if they are sure they pose no deadlock risk

For the last point, I discovered a pretty glaring problem with this
optimization: I did not include the bitmap created by a worker while
working on its first participating stripe in the final combined bitmap.
I only was combining the last bitmap file each worker worked on.

I had the workers make new bitmaps for each time that they attached to
the batch and participated because having them keep an open file
tracking information for a batch they are no longer attached to on the
chance that they might return and work on that batch was a
synchronization nightmare. It was difficult to figure out when to close
the file if they never returned and hard to make sure that the combining
worker is actually combining all the files from all participants who
were ever active.

I am sure I can hack around those, but I think we need a better solution
overall. After reverting those changes, loading and probing of stripes
after stripe 0 is serial. This is not only sub-optimal, it also means
that all the synchronization variables and code complexity around
coordinating work on fallback batches is practically wasted.
So, they have to be able to collaborate on stripes after the first
stripe. This version of the patch has correct results and no deadlock
hazard, however, it lacks parallelism on stripes after stripe 0.
I am looking for ideas on how to address the deadlock hazard more
efficiently.

The next big TODOs are:
- come up with a better solution to the potential tuple emitting/barrier
 waiting deadlock issue
- parallel batch 0 spilling complete



Hi Melanie,

I started looking at the patch to refresh my knowledge both of this
patch and parallel hash join, but I think it needs a rebase. The
changes in 7897e3bb90 apparently touched some of the code. I assume
you're working on a patch addressing the remaining TODOS, right?

I see you've switched to "stripe" naming - I find that a bit confusing,
because when I hear stripe I think about RAID, where it means pieces of
data interleaved and stored on different devices. But maybe that's just
me and it's a good name. Maybe it'd be better to keep the naming and
only tweak it at the end, not to disrupt reviews unnecessarily.

Now, a couple comments / questions about the code.


nodeHash.c
----------


1) MultiExecPrivateHash says this

  /*
   * Not subject to skew optimization, so either insert normally
   * or save to batch file if it belongs to another stripe
   */

I wonder what it means to "belong to another stripe". I understand what
that means for batches, which are identified by batchno computed from
the hash value. But I thought "stripes" are just work_mem-sized pieces
of a batch, so I don't quite understand this. Especially when the code
does not actually check "which stripe" the row belongs to.


2) I find the fields hashloop_fallback rather confusing. We have one in
HashJoinTable (and it's array of BufFile items) and another one in
ParallelHashJoinBatch (this time just bool).

I think HashJoinTable should be renamed to hashloopBatchFile (similarly
to the other BufFile arrays). Although I'm not sure why we even need
this file, when we have innerBatchFile? BufFile(s) are not exactly free,
in fact it's one of the problems for hashjoins with many batches.



3) I'm a bit puzzled about this formula in ExecHashIncreaseNumBatches

  childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch;

and also about this comment

  /*
   * TODO: what to do about tuples that don't go to the child
   * batch or stay in the current batch? (this is why we are
   * counting tuples to child and curbatch with two diff
   * variables in case the tuples go to a batch that isn't the
   * child)
   */
  if (batchno == childbatch)
    childbatch_outgoing_tuples++;

I thought each old batch is split into two new ones, and the tuples
either stay in the current one, or are moved to the new one - which I
presume is the childbatch, although I haven't tried to decode that
formula. So where else could the tuple go, as the comment tried to
suggest?



regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Reply via email to