Re: Race conditions with checkpointer and shutdown

2019-06-12 Thread Michael Paquier
On Wed, Jun 12, 2019 at 01:42:01PM -0400, Alvaro Herrera wrote: > I looked at the buildfarm failures for the recoveryCheck stage. It > looks like there is only one failure for branch master after this > commit, which was chipmunk saying: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=

Re: Race conditions with checkpointer and shutdown

2019-06-12 Thread Michael Paquier
On Wed, Jun 12, 2019 at 04:26:23PM -0400, Tom Lane wrote: > Poking at that, I find that a1a789eb5 back-patches reasonably painlessly > into v11 and v10, but trying to bring it back to 9.6 encounters a pile of > merge failures. Also, looking at the git logs shows that we did a hell > of a lot of su

Re: Race conditions with checkpointer and shutdown

2019-06-12 Thread Tom Lane
Alvaro Herrera writes: > On 2019-Jun-12, Tom Lane wrote: >> Should we do that now, or wait till after next week's releases? > IMO this has been hammered enough in master, and we still have a few > days in the back-branches for buildfarm, that it's okay to do it now. Poking at that, I find that a

Re: Race conditions with checkpointer and shutdown

2019-06-12 Thread Alvaro Herrera
On 2019-Jun-12, Tom Lane wrote: > Alvaro Herrera writes: > > Maybe now's a good time to get it back-patched? > > Should we do that now, or wait till after next week's releases? IMO this has been hammered enough in master, and we still have a few days in the back-branches for buildfarm, that it

Re: Race conditions with checkpointer and shutdown

2019-06-12 Thread Tom Lane
Alvaro Herrera writes: > On 2019-Apr-29, Tom Lane wrote: >> Ashwin Agrawal writes: >>> Have we decided if this will be applied to back branches? >> My feeling about it is "maybe eventually, but most definitely not >> the week before a set of minor releases". Some positive experience >> with Gre

Re: Race conditions with checkpointer and shutdown

2019-06-12 Thread Alvaro Herrera
On 2019-Apr-29, Tom Lane wrote: > Ashwin Agrawal writes: > > On Mon, Apr 29, 2019 at 10:36 AM Tom Lane wrote: > >> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46 > >> to see if it fixes the problem for you? > > > Yes, will give it a try on greenplum and report back the result. >

Re: Race conditions with checkpointer and shutdown

2019-04-29 Thread Tom Lane
Ashwin Agrawal writes: > On Mon, Apr 29, 2019 at 10:36 AM Tom Lane wrote: >> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46 >> to see if it fixes the problem for you? > Yes, will give it a try on greenplum and report back the result. > Have we decided if this will be applied to b

Re: Race conditions with checkpointer and shutdown

2019-04-29 Thread Ashwin Agrawal
On Mon, Apr 29, 2019 at 10:36 AM Tom Lane wrote: > > Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46 > to see if it fixes the problem for you? Yes, will give it a try on greenplum and report back the result. Have we decided if this will be applied to back branches?

Re: Race conditions with checkpointer and shutdown

2019-04-29 Thread Tom Lane
Ashwin Agrawal writes: > For Greenplum (based on 9.4 but current master code looks the same) we > did see deadlocks recently hit in CI many times for walreceiver which > I believe confirms above finding. > #0 __lll_lock_wait_private () at > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 > #

Re: Race conditions with checkpointer and shutdown

2019-04-29 Thread Ashwin Agrawal
On Sat, Apr 27, 2019 at 5:57 PM Tom Lane wrote: > > I have spent a fair amount of time trying to replicate these failures > locally, with little success. I now think that the most promising theory > is Munro's idea in [1] that the walreceiver is hanging up during its > unsafe attempt to do erepor

Re: Race conditions with checkpointer and shutdown

2019-04-29 Thread Andres Freund
Hi, On 2019-04-29 12:55:31 -0400, Tom Lane wrote: > Andres Freund writes: > > Hm, I'm not convinced that's OK. What if there's a network hickup? We'll > > wait until there's an OS tcp timeout, no? > > No. send() is only going to block if there's no room in the kernel's > buffers, and that would

Re: Race conditions with checkpointer and shutdown

2019-04-29 Thread Tom Lane
Andres Freund writes: > On 2019-04-27 20:56:51 -0400, Tom Lane wrote: >> The existing code does not use PQsetnonblocking, which means that it's >> theoretically at risk of blocking while pushing out data to the remote >> server. In practice I think that risk is negligible because (IIUC) we >> don

Re: Race conditions with checkpointer and shutdown

2019-04-29 Thread Andres Freund
Hi, On 2019-04-27 20:56:51 -0400, Tom Lane wrote: > Even if that isn't the proximate cause of the current reports, it's > clearly trouble waiting to happen, and we should get rid of it. > Accordingly, see attached proposed patch. This just flushes the > "immediate interrupt" stuff in favor of mak

Re: Race conditions with checkpointer and shutdown

2019-04-29 Thread Tom Lane
Thomas Munro writes: > On Sun, Apr 28, 2019 at 12:56 PM Tom Lane wrote: >> Even if that isn't the proximate cause of the current reports, it's >> clearly trouble waiting to happen, and we should get rid of it. > +1 > That reminds me, we should probably also clean up at least the > ereport-from-

Re: Race conditions with checkpointer and shutdown

2019-04-28 Thread Thomas Munro
On Sun, Apr 28, 2019 at 12:56 PM Tom Lane wrote: > Even if that isn't the proximate cause of the current reports, it's > clearly trouble waiting to happen, and we should get rid of it. > Accordingly, see attached proposed patch. This just flushes the > "immediate interrupt" stuff in favor of maki

Re: Race conditions with checkpointer and shutdown

2019-04-27 Thread Tom Lane
I have spent a fair amount of time trying to replicate these failures locally, with little success. I now think that the most promising theory is Munro's idea in [1] that the walreceiver is hanging up during its unsafe attempt to do ereport(FATAL) from inside a signal handler. It's extremely plau

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
>>> Maybe what we should be looking for is "why doesn't the walreceiver >>> shut down"? But the dragonet log you quote above shows the walreceiver >>> exiting, or at least starting to exit. Tis a puzzlement. huh ... take a look at this little stanza in PostmasterStateMachine: if (pmState ==

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
Thomas Munro writes: > On Fri, Apr 19, 2019 at 10:22 AM Tom Lane wrote: >> Maybe what we should be looking for is "why doesn't the walreceiver >> shut down"? But the dragonet log you quote above shows the walreceiver >> exiting, or at least starting to exit. Tis a puzzlement. > ... Is there so

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Thomas Munro
On Fri, Apr 19, 2019 at 10:22 AM Tom Lane wrote: > Thomas Munro writes: > > 2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver > > process due to administrator command > Maybe what we should be looking for is "why doesn't the walreceiver > shut down"? But the dragonet log you q

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
Thomas Munro writes: > Interesting, but I'm not sure how that could be though. Perhaps, a > bit like the other thing that cropped up in the build farm after that > commit, removing ~200ms of needless sleeping around an earlier online > CHECKPOINT made some other pre-existing race condition more l

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
Michael Paquier writes: > On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote: >> Anyway, this is *not* new in v12. > Indeed. It seems to me that v12 makes the problem easier to appear > though, and I got to wonder if c6c9474 is helping in that as more > cases are popping up since mid-March

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Thomas Munro
On Fri, Apr 19, 2019 at 2:30 PM Michael Paquier wrote: > On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote: > > Anyway, this is *not* new in v12. > > Indeed. It seems to me that v12 makes the problem easier to appear > though, and I got to wonder if c6c9474 is helping in that as more > cas

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Michael Paquier
On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote: > It's the latter. I searched the buildfarm database for failure logs > including the string "server does not shut down" within the last three > years, and got all of the hits attached. Not all of these look like > the failure pattern Mich

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
Thomas Munro writes: > This is a curious thing from dragonet's log: > 2019-04-16 08:23:24.178 CEST [8335] LOG: received fast shutdown request > 2019-04-16 08:23:24.178 CEST [8335] LOG: aborting any active transactions > 2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver > proce

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Thomas Munro
On Fri, Apr 19, 2019 at 2:53 AM Tom Lane wrote: > Question is, what other theory has anybody got? I wondered if there might be a way for PostmasterStateMachine() to be reached with without signals blocked, in the case where we fork a fresh checkpointers, and then it misses the SIGUSR2 that we imm

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
Robert Haas wrote (in the other thread): > Any idea whether it's something newly-introduced or of long standing? It's the latter. I searched the buildfarm database for failure logs including the string "server does not shut down" within the last three years, and got all of the hits attached. No

Re: Race conditions with checkpointer and shutdown

2019-04-18 Thread Tom Lane
Thomas Munro writes: > On Wed, Apr 17, 2019 at 10:45 AM Tom Lane wrote: >> I think what we need to look for is reasons why (1) the postmaster >> never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's >> main loop doesn't get to noticing shutdown_requested. >> >> A rather scary point

Re: Race conditions with checkpointer and shutdown

2019-04-17 Thread Thomas Munro
On Wed, Apr 17, 2019 at 10:45 AM Tom Lane wrote: > I think what we need to look for is reasons why (1) the postmaster > never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's > main loop doesn't get to noticing shutdown_requested. > > A rather scary point for (2) is that said main loop

Re: Race conditions with checkpointer and shutdown

2019-04-16 Thread Andres Freund
Hi, On 2019-04-16 17:05:36 -0700, Andres Freund wrote: > On 2019-04-16 18:59:37 -0400, Robert Haas wrote: > > On Tue, Apr 16, 2019 at 6:45 PM Tom Lane wrote: > > > Do we need to think harder about establishing rules for multiplexed > > > use of the process latch? I'm imagining some rule like "if

Re: Race conditions with checkpointer and shutdown

2019-04-16 Thread Andres Freund
Hi, On 2019-04-16 18:59:37 -0400, Robert Haas wrote: > On Tue, Apr 16, 2019 at 6:45 PM Tom Lane wrote: > > Do we need to think harder about establishing rules for multiplexed > > use of the process latch? I'm imagining some rule like "if you are > > not the outermost event loop of a process, you

Re: Race conditions with checkpointer and shutdown

2019-04-16 Thread Robert Haas
On Tue, Apr 16, 2019 at 6:45 PM Tom Lane wrote: > Do we need to think harder about establishing rules for multiplexed > use of the process latch? I'm imagining some rule like "if you are > not the outermost event loop of a process, you do not get to > summarily clear MyLatch. Make sure to leave

Re: Race conditions with checkpointer and shutdown

2019-04-16 Thread Tom Lane
Michael Paquier writes: > The buildfarm has reported two similar failures when shutting down a > node: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01 > I

Race conditions with checkpointer and shutdown

2019-04-16 Thread Michael Paquier
Hi all, This is a continuation of the following thread, but I prefer spawning a new thread for clarity: https://www.postgresql.org/message-id/20190416064512.gj2...@paquier.xyz The buildfarm has reported two similar failures when shutting down a node: https://buildfarm.postgresql.org/cgi-bin/show_