On Wed, Jun 12, 2019 at 01:42:01PM -0400, Alvaro Herrera wrote:
> I looked at the buildfarm failures for the recoveryCheck stage. It
> looks like there is only one failure for branch master after this
> commit, which was chipmunk saying:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=
On Wed, Jun 12, 2019 at 04:26:23PM -0400, Tom Lane wrote:
> Poking at that, I find that a1a789eb5 back-patches reasonably painlessly
> into v11 and v10, but trying to bring it back to 9.6 encounters a pile of
> merge failures. Also, looking at the git logs shows that we did a hell
> of a lot of su
Alvaro Herrera writes:
> On 2019-Jun-12, Tom Lane wrote:
>> Should we do that now, or wait till after next week's releases?
> IMO this has been hammered enough in master, and we still have a few
> days in the back-branches for buildfarm, that it's okay to do it now.
Poking at that, I find that a
On 2019-Jun-12, Tom Lane wrote:
> Alvaro Herrera writes:
> > Maybe now's a good time to get it back-patched?
>
> Should we do that now, or wait till after next week's releases?
IMO this has been hammered enough in master, and we still have a few
days in the back-branches for buildfarm, that it
Alvaro Herrera writes:
> On 2019-Apr-29, Tom Lane wrote:
>> Ashwin Agrawal writes:
>>> Have we decided if this will be applied to back branches?
>> My feeling about it is "maybe eventually, but most definitely not
>> the week before a set of minor releases". Some positive experience
>> with Gre
On 2019-Apr-29, Tom Lane wrote:
> Ashwin Agrawal writes:
> > On Mon, Apr 29, 2019 at 10:36 AM Tom Lane wrote:
> >> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46
> >> to see if it fixes the problem for you?
>
> > Yes, will give it a try on greenplum and report back the result.
>
Ashwin Agrawal writes:
> On Mon, Apr 29, 2019 at 10:36 AM Tom Lane wrote:
>> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46
>> to see if it fixes the problem for you?
> Yes, will give it a try on greenplum and report back the result.
> Have we decided if this will be applied to b
On Mon, Apr 29, 2019 at 10:36 AM Tom Lane wrote:
>
> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46
> to see if it fixes the problem for you?
Yes, will give it a try on greenplum and report back the result.
Have we decided if this will be applied to back branches?
Ashwin Agrawal writes:
> For Greenplum (based on 9.4 but current master code looks the same) we
> did see deadlocks recently hit in CI many times for walreceiver which
> I believe confirms above finding.
> #0 __lll_lock_wait_private () at
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
> #
On Sat, Apr 27, 2019 at 5:57 PM Tom Lane wrote:
>
> I have spent a fair amount of time trying to replicate these failures
> locally, with little success. I now think that the most promising theory
> is Munro's idea in [1] that the walreceiver is hanging up during its
> unsafe attempt to do erepor
Hi,
On 2019-04-29 12:55:31 -0400, Tom Lane wrote:
> Andres Freund writes:
> > Hm, I'm not convinced that's OK. What if there's a network hickup? We'll
> > wait until there's an OS tcp timeout, no?
>
> No. send() is only going to block if there's no room in the kernel's
> buffers, and that would
Andres Freund writes:
> On 2019-04-27 20:56:51 -0400, Tom Lane wrote:
>> The existing code does not use PQsetnonblocking, which means that it's
>> theoretically at risk of blocking while pushing out data to the remote
>> server. In practice I think that risk is negligible because (IIUC) we
>> don
Hi,
On 2019-04-27 20:56:51 -0400, Tom Lane wrote:
> Even if that isn't the proximate cause of the current reports, it's
> clearly trouble waiting to happen, and we should get rid of it.
> Accordingly, see attached proposed patch. This just flushes the
> "immediate interrupt" stuff in favor of mak
Thomas Munro writes:
> On Sun, Apr 28, 2019 at 12:56 PM Tom Lane wrote:
>> Even if that isn't the proximate cause of the current reports, it's
>> clearly trouble waiting to happen, and we should get rid of it.
> +1
> That reminds me, we should probably also clean up at least the
> ereport-from-
On Sun, Apr 28, 2019 at 12:56 PM Tom Lane wrote:
> Even if that isn't the proximate cause of the current reports, it's
> clearly trouble waiting to happen, and we should get rid of it.
> Accordingly, see attached proposed patch. This just flushes the
> "immediate interrupt" stuff in favor of maki
I have spent a fair amount of time trying to replicate these failures
locally, with little success. I now think that the most promising theory
is Munro's idea in [1] that the walreceiver is hanging up during its
unsafe attempt to do ereport(FATAL) from inside a signal handler. It's
extremely plau
>>> Maybe what we should be looking for is "why doesn't the walreceiver
>>> shut down"? But the dragonet log you quote above shows the walreceiver
>>> exiting, or at least starting to exit. Tis a puzzlement.
huh ... take a look at this little stanza in PostmasterStateMachine:
if (pmState ==
Thomas Munro writes:
> On Fri, Apr 19, 2019 at 10:22 AM Tom Lane wrote:
>> Maybe what we should be looking for is "why doesn't the walreceiver
>> shut down"? But the dragonet log you quote above shows the walreceiver
>> exiting, or at least starting to exit. Tis a puzzlement.
> ... Is there so
On Fri, Apr 19, 2019 at 10:22 AM Tom Lane wrote:
> Thomas Munro writes:
> > 2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver
> > process due to administrator command
> Maybe what we should be looking for is "why doesn't the walreceiver
> shut down"? But the dragonet log you q
Thomas Munro writes:
> Interesting, but I'm not sure how that could be though. Perhaps, a
> bit like the other thing that cropped up in the build farm after that
> commit, removing ~200ms of needless sleeping around an earlier online
> CHECKPOINT made some other pre-existing race condition more l
Michael Paquier writes:
> On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
>> Anyway, this is *not* new in v12.
> Indeed. It seems to me that v12 makes the problem easier to appear
> though, and I got to wonder if c6c9474 is helping in that as more
> cases are popping up since mid-March
On Fri, Apr 19, 2019 at 2:30 PM Michael Paquier wrote:
> On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
> > Anyway, this is *not* new in v12.
>
> Indeed. It seems to me that v12 makes the problem easier to appear
> though, and I got to wonder if c6c9474 is helping in that as more
> cas
On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
> It's the latter. I searched the buildfarm database for failure logs
> including the string "server does not shut down" within the last three
> years, and got all of the hits attached. Not all of these look like
> the failure pattern Mich
Thomas Munro writes:
> This is a curious thing from dragonet's log:
> 2019-04-16 08:23:24.178 CEST [8335] LOG: received fast shutdown request
> 2019-04-16 08:23:24.178 CEST [8335] LOG: aborting any active transactions
> 2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver
> proce
On Fri, Apr 19, 2019 at 2:53 AM Tom Lane wrote:
> Question is, what other theory has anybody got?
I wondered if there might be a way for PostmasterStateMachine() to be
reached with without signals blocked, in the case where we fork a
fresh checkpointers, and then it misses the SIGUSR2 that we
imm
Robert Haas wrote (in the other thread):
> Any idea whether it's something newly-introduced or of long standing?
It's the latter. I searched the buildfarm database for failure logs
including the string "server does not shut down" within the last three
years, and got all of the hits attached. No
Thomas Munro writes:
> On Wed, Apr 17, 2019 at 10:45 AM Tom Lane wrote:
>> I think what we need to look for is reasons why (1) the postmaster
>> never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
>> main loop doesn't get to noticing shutdown_requested.
>>
>> A rather scary point
On Wed, Apr 17, 2019 at 10:45 AM Tom Lane wrote:
> I think what we need to look for is reasons why (1) the postmaster
> never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
> main loop doesn't get to noticing shutdown_requested.
>
> A rather scary point for (2) is that said main loop
Hi,
On 2019-04-16 17:05:36 -0700, Andres Freund wrote:
> On 2019-04-16 18:59:37 -0400, Robert Haas wrote:
> > On Tue, Apr 16, 2019 at 6:45 PM Tom Lane wrote:
> > > Do we need to think harder about establishing rules for multiplexed
> > > use of the process latch? I'm imagining some rule like "if
Hi,
On 2019-04-16 18:59:37 -0400, Robert Haas wrote:
> On Tue, Apr 16, 2019 at 6:45 PM Tom Lane wrote:
> > Do we need to think harder about establishing rules for multiplexed
> > use of the process latch? I'm imagining some rule like "if you are
> > not the outermost event loop of a process, you
On Tue, Apr 16, 2019 at 6:45 PM Tom Lane wrote:
> Do we need to think harder about establishing rules for multiplexed
> use of the process latch? I'm imagining some rule like "if you are
> not the outermost event loop of a process, you do not get to
> summarily clear MyLatch. Make sure to leave
Michael Paquier writes:
> The buildfarm has reported two similar failures when shutting down a
> node:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01
> I
Hi all,
This is a continuation of the following thread, but I prefer spawning
a new thread for clarity:
https://www.postgresql.org/message-id/20190416064512.gj2...@paquier.xyz
The buildfarm has reported two similar failures when shutting down a
node:
https://buildfarm.postgresql.org/cgi-bin/show_
33 matches
Mail list logo