Re: [HACKERS] parallel.c oblivion of worker-startup failures

Peter Geoghegan Tue, 23 Jan 2018 20:34:54 -0800

On Tue, Jan 23, 2018 at 8:25 PM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
>> Hmm, I think that case will be addressed because tuple queues can
>> detect if the leader is not attached.  It does in code path
>> shm_mq_receive->shm_mq_counterparty_gone.  In
>> shm_mq_counterparty_gone, it can detect if the worker is gone by using
>> GetBackgroundWorkerPid.  Moreover, I have manually tested this
>> particular case before saying your patch is fine.  Do you have some
>> other case in mind which I am missing?
>
> Hmm.  Yeah.  I can't seem to reach a stuck case and was probably just
> confused and managed to confuse Robert too.  If you make
> fork_process() fail randomly (see attached), I see that there are a
> couple of easily reachable failure modes (example session at bottom of
> message):
>
> 1.  HandleParallelMessages() is reached and raises a "lost connection
> to parallel worker" error because shm_mq_receive() returns
> SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked
> GetBackgroundWorkerPid() just as you said.  I guess that's happening
> because some other process is (coincidentally) sending
> PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a
> process is unexpectedly stopped.
>
> 2.  WaitForParallelWorkersToFinish() is reached and raises a "parallel
> worker failed to initialize" error.  TupleQueueReaderNext() set done
> to true, because shm_mq_receive() returned SHM_MQ_DETACHED.  Once
> again, that is because shm_mq_counterparty_gone() returned true.  This
> is the bit Robert and I missed in our off-list discussion.
>
> As long as we always get our latch set by the postmaster after a fork
> failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is
> guaranteed to return BGWH_STOPPED after that, and as long as we only
> ever use latch/CFI loops to wait, and as long as we try to read from a
> shm_mq, then I don't see a failure mode that hangs.


What about the parallel_leader_participation=off case?

-- 
Peter Geoghegan

Re: [HACKERS] parallel.c oblivion of worker-startup failures

Reply via email to