Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Peter Geoghegan
On Wed, Jan 24, 2018 at 3:37 PM, Robert Haas wrote: > Well, I've been resisting that approach from the very beginning of > parallel query. Eventually, I hope that we're going to go in the > direction of changing our mind about how many workers parallel > operations use "on the fly". For example,

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Robert Haas
On Wed, Jan 24, 2018 at 5:52 PM, Peter Geoghegan wrote: >> If we made the Gather node wait only for workers that actually reached >> the Gather -- either by using a Barrier or by some other technique -- >> then this would be a lot less fragile. And the same kind of technique >> would work for par

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Peter Geoghegan
On Wed, Jan 24, 2018 at 12:05 PM, Robert Haas wrote: > In Thomas's test case, he's using 4 workers, and if even one of those > manages to start, then it'll probably do so after any fork failures > have already occurred, and the error will be noticed when that process > sends its first message to t

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Robert Haas
On Tue, Jan 23, 2018 at 9:45 PM, Amit Kapila wrote: > Hmm, I think that case will be addressed because tuple queues can > detect if the leader is not attached. It does in code path > shm_mq_receive->shm_mq_counterparty_gone. In > shm_mq_counterparty_gone, it can detect if the worker is gone by u

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Peter Geoghegan
On Wed, Jan 24, 2018 at 1:57 AM, Thomas Munro wrote: > On Wed, Jan 24, 2018 at 5:25 PM, Thomas Munro > wrote: >> If there were some way for the postmaster to cause reason >> PROCSIG_PARALLEL_MESSAGE to be set in the leader process instead of >> just notification via kill(SIGUSR1) when it fails to

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-24 Thread Thomas Munro
On Wed, Jan 24, 2018 at 5:25 PM, Thomas Munro wrote: > If there were some way for the postmaster to cause reason > PROCSIG_PARALLEL_MESSAGE to be set in the leader process instead of > just notification via kill(SIGUSR1) when it fails to fork a parallel > worker, we'd get (1) for free in any latch

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-23 Thread Peter Geoghegan
On Tue, Jan 23, 2018 at 9:38 PM, Amit Kapila wrote: > On Wed, Jan 24, 2018 at 10:38 AM, Peter Geoghegan wrote: >> The leader can go ahead and sort before calling something like a new >> WaitForParallelWorkersToAttach() function (or even >> WaitForParallelWorkersToFinish()). If we did add a >> Wai

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-23 Thread Amit Kapila
On Wed, Jan 24, 2018 at 10:38 AM, Peter Geoghegan wrote: > On Tue, Jan 23, 2018 at 9:02 PM, Thomas Munro > wrote: >>> Yes, this is what I am trying to explain on parallel create index >>> thread. I think there we need to either use >>> WaitForParallelWorkersToFinish or WaitForParallelWorkersToAt

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-23 Thread Peter Geoghegan
On Tue, Jan 23, 2018 at 9:02 PM, Thomas Munro wrote: >> Yes, this is what I am trying to explain on parallel create index >> thread. I think there we need to either use >> WaitForParallelWorkersToFinish or WaitForParallelWorkersToAttach (a >> new API as proposed in that thread) if we don't want t

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-23 Thread Thomas Munro
On Wed, Jan 24, 2018 at 5:43 PM, Amit Kapila wrote: >> Hmm. Yeah. I can't seem to reach a stuck case and was probably just >> confused and managed to confuse Robert too. If you make >> fork_process() fail randomly (see attached), I see that there are a >> couple of easily reachable failure mode

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-23 Thread Amit Kapila
On Wed, Jan 24, 2018 at 9:55 AM, Thomas Munro wrote: > On Wed, Jan 24, 2018 at 3:45 PM, Amit Kapila wrote: >> On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas wrote: >>> In the case of Gather, for example, we won't >>> WaitForParallelWorkersToFinish() until we've read all the tuples. If >>> there's

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-23 Thread Amit Kapila
On Wed, Jan 24, 2018 at 10:03 AM, Peter Geoghegan wrote: > On Tue, Jan 23, 2018 at 8:25 PM, Thomas Munro > wrote: >>> Hmm, I think that case will be addressed because tuple queues can >>> detect if the leader is not attached. It does in code path >>> shm_mq_receive->shm_mq_counterparty_gone. In

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-23 Thread Peter Geoghegan
On Tue, Jan 23, 2018 at 8:25 PM, Thomas Munro wrote: >> Hmm, I think that case will be addressed because tuple queues can >> detect if the leader is not attached. It does in code path >> shm_mq_receive->shm_mq_counterparty_gone. In >> shm_mq_counterparty_gone, it can detect if the worker is gone

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-23 Thread Thomas Munro
On Wed, Jan 24, 2018 at 3:45 PM, Amit Kapila wrote: > On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas wrote: >> In the case of Gather, for example, we won't >> WaitForParallelWorkersToFinish() until we've read all the tuples. If >> there's a tuple queue that does not yet have a sender, then we'll j

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-23 Thread Amit Kapila
On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas wrote: > On Tue, Jan 23, 2018 at 11:15 AM, Robert Haas wrote: >> On Mon, Jan 22, 2018 at 10:56 PM, Amit Kapila >> wrote: >> >> OK. I've committed this patch and back-patched it to 9.6. >> Back-patching to 9.5 didn't looks simple because nworkers_lau

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-23 Thread Peter Geoghegan
On Tue, Jan 23, 2018 at 1:05 PM, Robert Haas wrote: > I was just talking to Thomas, and one or the other of us realized that > this doesn't completely fix the problem. I think that if a worker > fails, even by some unorthodox mechanism like an abrupt proc_exit(1), > after the point where it attac

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-23 Thread Robert Haas
On Tue, Jan 23, 2018 at 11:15 AM, Robert Haas wrote: > On Mon, Jan 22, 2018 at 10:56 PM, Amit Kapila wrote: >> It does turn such cases into error but only at the end when someone >> calls WaitForParallelWorkersToFinish. If one tries to rely on it at >> any other time, it won't work as I think is

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-23 Thread Robert Haas
On Mon, Jan 22, 2018 at 10:56 PM, Amit Kapila wrote: > It does turn such cases into error but only at the end when someone > calls WaitForParallelWorkersToFinish. If one tries to rely on it at > any other time, it won't work as I think is the case for Parallel > Create Index where Peter G. is try

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-22 Thread Amit Kapila
On Mon, Jan 22, 2018 at 7:16 PM, Robert Haas wrote: > On Fri, Jan 19, 2018 at 10:59 PM, Amit Kapila wrote: >> The patch doesn't apply cleanly on the head, but after rebasing it, I >> have reviewed and tested it and it seems to be working fine. Apart >> from this specific issue, I think we should

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-22 Thread Robert Haas
On Fri, Jan 19, 2018 at 10:59 PM, Amit Kapila wrote: > The patch doesn't apply cleanly on the head, but after rebasing it, I > have reviewed and tested it and it seems to be working fine. Apart > from this specific issue, I think we should consider making > nworkers_launched reliable (at the very

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-19 Thread Amit Kapila
On Thu, Jan 18, 2018 at 8:53 PM, Robert Haas wrote: > On Thu, Dec 21, 2017 at 9:13 AM, Amit Kapila wrote: >> I am not against using the way specific to parallel context layer as >> described by you above. However, I was trying to see if there is >> some general purpose solution as the low-impac

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2018-01-18 Thread Robert Haas
On Thu, Dec 21, 2017 at 9:13 AM, Amit Kapila wrote: > I am not against using the way specific to parallel context layer as > described by you above. However, I was trying to see if there is > some general purpose solution as the low-impact way is not very > straightforward. I think you can go a

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-21 Thread Amit Kapila
On Thu, Dec 21, 2017 at 6:26 PM, Robert Haas wrote: > On Thu, Dec 21, 2017 at 6:46 AM, Amit Kapila wrote: >> What if we don't allow to reuse such slots till the backend/session >> that has registered it performs unregister? Currently, we don't seem >> to have an API corresponding to Register*Bac

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-21 Thread Robert Haas
On Thu, Dec 21, 2017 at 6:46 AM, Amit Kapila wrote: > What if we don't allow to reuse such slots till the backend/session > that has registered it performs unregister? Currently, we don't seem > to have an API corresponding to Register*BackgroundWorker() which can > be used to unregister, but may

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-21 Thread Amit Kapila
On Thu, Dec 21, 2017 at 2:39 AM, Robert Haas wrote: > On Tue, Dec 19, 2017 at 11:28 PM, Amit Kapila wrote: >> That is not the main point I am bothered about. I am concerned that >> the patch proposed by you can lead to hang if there is a crash (abrupt >> failure like proc_exit(1)) after attachin

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-20 Thread Robert Haas
On Tue, Dec 19, 2017 at 11:28 PM, Amit Kapila wrote: > That is not the main point I am bothered about. I am concerned that > the patch proposed by you can lead to hang if there is a crash (abrupt > failure like proc_exit(1)) after attaching to the error queue. This is > explained in my email up t

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-19 Thread Amit Kapila
On Tue, Dec 19, 2017 at 9:01 PM, Robert Haas wrote: > On Tue, Dec 19, 2017 at 5:01 AM, Amit Kapila wrote: >> I think it would have been much easier to fix this problem if we would >> have some way to differentiate whether the worker has stopped >> gracefully or not. Do you think it makes sense t

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-19 Thread Robert Haas
On Tue, Dec 19, 2017 at 5:01 AM, Amit Kapila wrote: > I think it would have been much easier to fix this problem if we would > have some way to differentiate whether the worker has stopped > gracefully or not. Do you think it makes sense to introduce such a > state in the background worker machin

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-19 Thread Amit Kapila
On Thu, Dec 14, 2017 at 3:05 AM, Robert Haas wrote: > On Wed, Dec 13, 2017 at 1:41 AM, Amit Kapila wrote: > >> This also doesn't appear to be completely safe. If we add >> proc_exit(1) after attaching to error queue (say after >> pq_set_parallel_master) in the worker, then it will lead to *hang*

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-13 Thread Robert Haas
On Wed, Dec 13, 2017 at 1:41 AM, Amit Kapila wrote: > I have tried to reproduce this situation by adding error case in > worker (just before worker returns success ('X' message)) and by > having breakpoint in WaitForParallelWorkersToFinish. What, I noticed > is that worker is not allowed to exit

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-12 Thread Amit Kapila
On Wed, Dec 13, 2017 at 12:31 AM, Robert Haas wrote: > On Mon, Dec 11, 2017 at 10:41 PM, Amit Kapila wrote: >> In particular, if the worker crashed (say somebody forcefully killed >> the worker), then I think it will lead to a restart of the server. So >> not sure, adding anything about that in

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-12 Thread Robert Haas
On Mon, Dec 11, 2017 at 10:41 PM, Amit Kapila wrote: > In particular, if the worker crashed (say somebody forcefully killed > the worker), then I think it will lead to a restart of the server. So > not sure, adding anything about that in the comment will make much > sense. I think it's dangerous

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-11 Thread Amit Kapila
On Tue, Dec 12, 2017 at 9:00 AM, Amit Kapila wrote: > On Mon, Dec 11, 2017 at 11:27 PM, Robert Haas wrote: >> On Sun, Dec 10, 2017 at 11:07 PM, Amit Kapila >> wrote: >>> Okay, see the attached and let me know if that suffices the need? >> >> + * Check for unexpected worker death. T

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-11 Thread Amit Kapila
On Mon, Dec 11, 2017 at 11:27 PM, Robert Haas wrote: > On Sun, Dec 10, 2017 at 11:07 PM, Amit Kapila wrote: >> Okay, see the attached and let me know if that suffices the need? > > + * Check for unexpected worker death. This will ensure that if > + * the postmaster failed

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-11 Thread Robert Haas
On Sun, Dec 10, 2017 at 11:07 PM, Amit Kapila wrote: > Okay, see the attached and let me know if that suffices the need? + * Check for unexpected worker death. This will ensure that if + * the postmaster failed to start the worker, then we don't wait + * for i

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-10 Thread Amit Kapila
On Fri, Dec 8, 2017 at 9:45 PM, Robert Haas wrote: > On Fri, Dec 8, 2017 at 4:56 AM, Amit Kapila wrote: >> I think the optimization you are suggesting has a merit over what I >> have done in the patch. However, one point to note is that we are >> anyway going to call that function down in the pa

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-08 Thread Robert Haas
On Fri, Dec 8, 2017 at 4:56 AM, Amit Kapila wrote: > I think the optimization you are suggesting has a merit over what I > have done in the patch. However, one point to note is that we are > anyway going to call that function down in the path( > WaitForParallelWorkersToExit->WaitForBackgroundWork

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-08 Thread Amit Kapila
On Wed, Dec 6, 2017 at 8:02 PM, Robert Haas wrote: > On Wed, Dec 6, 2017 at 1:43 AM, Amit Kapila wrote: >> Looks good to me. > > Committed and back-patched to 9.4 where dynamic background workers > were introduced. > Thanks. >>> Barring objections, I'll commit and >>> back-patch this; then we c

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-06 Thread Robert Haas
On Wed, Dec 6, 2017 at 9:54 AM, Tom Lane wrote: > Maybe track "worker is known to have launched" in the leader's state > someplace? That's probably one piece of it, yes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-06 Thread Tom Lane
Robert Haas writes: > I'm not in love with that part of the fix; the other parts of that if > statement are just testing variables, and a function call that takes > and releases an LWLock is a lot more expensive. Furthermore, that > test will often be hit in practice, because we'll often arrive a

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-06 Thread Robert Haas
On Wed, Dec 6, 2017 at 1:43 AM, Amit Kapila wrote: > Looks good to me. Committed and back-patched to 9.4 where dynamic background workers were introduced. >> Barring objections, I'll commit and >> back-patch this; then we can deal with the other part of this >> afterward. > > Sure, I will rebase

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-05 Thread Amit Kapila
On Tue, Dec 5, 2017 at 11:34 PM, Robert Haas wrote: > On Wed, Nov 29, 2017 at 12:11 AM, Michael Paquier > wrote: >> On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila wrote: >>> The patch still applies (with some hunks). I have added it in CF [1] >>> to avoid losing track. >>> >>> [1] - https://commi

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-05 Thread Robert Haas
On Wed, Nov 29, 2017 at 12:11 AM, Michael Paquier wrote: > On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila wrote: >> The patch still applies (with some hunks). I have added it in CF [1] >> to avoid losing track. >> >> [1] - https://commitfest.postgresql.org/15/1341/ > > This did not get reviews and

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-11-28 Thread Michael Paquier
On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila wrote: > The patch still applies (with some hunks). I have added it in CF [1] > to avoid losing track. > > [1] - https://commitfest.postgresql.org/15/1341/ This did not get reviews and the patch still applies. I am moving it to next CF. -- Michael