Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Robert Haas Wed, 17 Jan 2018 10:29:27 -0800

On Wed, Jan 17, 2018 at 12:27 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> I think that both problems (the live _bt_parallel_scan_and_sort() bug,
> as well as the general issue with needing to account for parallel
> worker fork() failure) are likely solvable by not using
> tuplesort_leader_wait(), and instead calling
> WaitForParallelWorkersToFinish(). Which you suggested already.


I'm wondering if this shouldn't instead be handled by using the new
Barrier facilities.  I think it would work like this:

- leader calls BarrierInit(..., 0)
- leader calls BarrierAttach() before starting workers.
- each worker, before reading anything from the parallel scan, calls
BarrierAttach().  if the phase returned is greater than 0, then the
worker arrived at the barrier after all the work was done, and should
exit immediately.
- each worker, after finishing sorting, calls BarrierArriveAndWait().
leader, after sorting, also calls BarrierArriveAndWait().
- when BarrierArriveAndWait() returns in the leader, all workers that
actually started (and did so quickly enough) have arrived at the
barrier.  The leader can now do leader_takeover_tapes, being careful
to adopt only the tapes actually created, since some workers may have
failed to launch or launched only after sorting was already complete.
- meanwhile, the workers again call BarrierArriveAndWait().
- after it's done taking over tapes, the leader calls BarrierDetach(),
releasing the workers.
- the workers call BarrierDetach() and then exit -- or maybe they
don't even really need to detach

So the barrier phase numbers would have the following meanings:

0 - sorting
1 - taking over tapes
2 - done

This could be slightly more elegant if BarrierArriveAndWait() had an
additional argument indicating the phase number for which the backend
could wait, or maybe the number of phases for which it should wait.
Then, the workers could avoid having to call BarrierArriveAndWait()
twice in a row.

While I find the Barrier API slightly confusing -- and I suspect I'm
not entirely alone -- I don't think that's a good excuse for
reinventing the wheel.  The problem of needing to wait for every
process that does A (in this case, read tuples from the scan) to also
do B (in this case, finish sorting those tuples) is a very general one
that is deserving of a general solution.  Unless somebody comes up
with a better plan, Barrier seems to be the way to do that in
PostgreSQL.

I don't think using WaitForParallelWorkersToFinish() is a good idea.
That would require workers to hold onto their tuplesorts until after
losing the ability to send messages to the leader, which doesn't sound
like a very good plan.  We don't want workers to detach from their
error queues until the bitter end, lest errors go unreported.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Reply via email to