Re: [HACKERS] parallel.c oblivion of worker-startup failures

Amit Kapila Fri, 08 Dec 2017 01:56:48 -0800

On Wed, Dec 6, 2017 at 8:02 PM, Robert Haas <[email protected]> wrote:
> On Wed, Dec 6, 2017 at 1:43 AM, Amit Kapila <[email protected]> wrote:
>> Looks good to me.
>
> Committed and back-patched to 9.4 where dynamic background workers
> were introduced.
>


Thanks.

>>> Barring objections, I'll commit and
>>> back-patch this; then we can deal with the other part of this
>>> afterward.
>>
>> Sure, I will rebase on top of the commit unless you have some comments
>> on the remaining part.
>
> I'm not in love with that part of the fix; the other parts of that if
> statement are just testing variables, and a function call that takes
> and releases an LWLock is a lot more expensive.  Furthermore, that
> test will often be hit in practice, because we'll often arrive at this
> function before workers have actually finished.  On top of that, we'll
> typically arrive here having already communicated with the worker in
> some way, such as by receiving tuples, which means that in most cases
> we already know that the worker was alive at least at some point, and
> therefore the extra test isn't necessary.  We only need that test, if
> I understand correctly, to cover the failure-to-launch case, which is
> presumably very rare.
>
> All that having been said, I don't quite know how to do it better.  It
> would be easy enough to modify this function so that if it ever
> received a result other then BGWH_NOT_YET_STARTED for a worker, it
> wouldn't call this function again for that worker.  However, that's
> only mitigating the problem, not really fixing it.  We'll still end up
> frequently inquiring - admittedly only once - about the status of a
> worker with which we've previously communicated via the tuple queue.
> Maybe we just have to live with that problem, but do you (or does
> anyone else) have an idea?
>

I think the optimization you are suggesting has a merit over what I
have done in the patch.  However, one point to note is that we are
anyway going to call that function down in the path(
WaitForParallelWorkersToExit->WaitForBackgroundWorkerShutdown->GetBackgroundWorkerPid),
so we might want to take the advantage of calling it first time as
well.  We can actually cache the status of workers that have returned
BGWH_STOPPED and use it later so that we don't need to make this
function call again for workers which are already stopped.  We can
even do what Tom is suggesting to avoid calling it for workers which
are known to be launched, but I am really not sure if that function
call is costly enough that we need to maintain one extra state to
avoid that.

While looking into this code path, I wonder how much we are gaining by
having two separate calls (WaitForParallelWorkersToFinish and
WaitForParallelWorkersToExit) to check the status of workers after
finishing the parallel execution?  They are used together in rescan
code path, so apparently there is no benefit calling them separately
there.  OTOH, during shutdown of nodes, it will often be the case that
they will be called in a short duration after each other except for
the case where we need to shut down the node for the early exit like
when there is a limit clause or such.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel.c oblivion of worker-startup failures

Reply via email to