On Wed, Jan 17, 2018 at 12:27 PM, Peter Geoghegan <p...@bowt.ie> wrote: > I think that both problems (the live _bt_parallel_scan_and_sort() bug, > as well as the general issue with needing to account for parallel > worker fork() failure) are likely solvable by not using > tuplesort_leader_wait(), and instead calling > WaitForParallelWorkersToFinish(). Which you suggested already.
I'm wondering if this shouldn't instead be handled by using the new Barrier facilities. I think it would work like this: - leader calls BarrierInit(..., 0) - leader calls BarrierAttach() before starting workers. - each worker, before reading anything from the parallel scan, calls BarrierAttach(). if the phase returned is greater than 0, then the worker arrived at the barrier after all the work was done, and should exit immediately. - each worker, after finishing sorting, calls BarrierArriveAndWait(). leader, after sorting, also calls BarrierArriveAndWait(). - when BarrierArriveAndWait() returns in the leader, all workers that actually started (and did so quickly enough) have arrived at the barrier. The leader can now do leader_takeover_tapes, being careful to adopt only the tapes actually created, since some workers may have failed to launch or launched only after sorting was already complete. - meanwhile, the workers again call BarrierArriveAndWait(). - after it's done taking over tapes, the leader calls BarrierDetach(), releasing the workers. - the workers call BarrierDetach() and then exit -- or maybe they don't even really need to detach So the barrier phase numbers would have the following meanings: 0 - sorting 1 - taking over tapes 2 - done This could be slightly more elegant if BarrierArriveAndWait() had an additional argument indicating the phase number for which the backend could wait, or maybe the number of phases for which it should wait. Then, the workers could avoid having to call BarrierArriveAndWait() twice in a row. While I find the Barrier API slightly confusing -- and I suspect I'm not entirely alone -- I don't think that's a good excuse for reinventing the wheel. The problem of needing to wait for every process that does A (in this case, read tuples from the scan) to also do B (in this case, finish sorting those tuples) is a very general one that is deserving of a general solution. Unless somebody comes up with a better plan, Barrier seems to be the way to do that in PostgreSQL. I don't think using WaitForParallelWorkersToFinish() is a good idea. That would require workers to hold onto their tuplesorts until after losing the ability to send messages to the leader, which doesn't sound like a very good plan. We don't want workers to detach from their error queues until the bitter end, lest errors go unreported. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company