At Mon, 30 May 2022 12:01:55 -0700, Andres Freund wrote in
> Hi,
>
> On 2022-03-27 22:37:34 -0700, Andres Freund wrote:
> > On 2022-03-27 17:36:14 -0400, Tom Lane wrote:
> > > Andres Freund writes:
> > > > I still feel like there's something off here. But that's probably not
> > > > enough
> >
Hi,
On 2022-03-27 22:37:34 -0700, Andres Freund wrote:
> On 2022-03-27 17:36:14 -0400, Tom Lane wrote:
> > Andres Freund writes:
> > > I still feel like there's something off here. But that's probably not
> > > enough
> > > to keep causing failures. I'm inclined to leave the debugging in for a b
Hi,
On 2022-03-27 17:36:14 -0400, Tom Lane wrote:
> Andres Freund writes:
> > I still feel like there's something off here. But that's probably not enough
> > to keep causing failures. I'm inclined to leave the debugging in for a bit
> > longer, but not fail the test anymore?
>
> WFM.
I've done
Andres Freund writes:
> I still feel like there's something off here. But that's probably not enough
> to keep causing failures. I'm inclined to leave the debugging in for a bit
> longer, but not fail the test anymore?
WFM.
> The way the temporary slot removal hangs for a while seems just off:
Hi,
On 2022-03-27 15:28:05 -0400, Tom Lane wrote:
> Andres Freund writes:
> > How about the attached variation, which retries (for 15s, with 100ms sleeps)
> > if there are multiple walsenders, printing the debugging info each time?
> > It'll
> > still fail the test if later iterations find just
Andres Freund writes:
> How about the attached variation, which retries (for 15s, with 100ms sleeps)
> if there are multiple walsenders, printing the debugging info each time? It'll
> still fail the test if later iterations find just one walsender, which seems
> the right behaviour for now.
We ha
Andres Freund writes:
> How about the attached variation, which retries (for 15s, with 100ms sleeps)
> if there are multiple walsenders, printing the debugging info each time? It'll
> still fail the test if later iterations find just one walsender, which seems
> the right behaviour for now.
OK by
Hi,
On 2022-03-24 15:05:40 -0400, Tom Lane wrote:
> I just noticed something very interesting: in a couple of recent
> buildfarm runs with this failure, the pg_stat_activity printout
> no longer shows the extra walsender:
Oh. That is interesting. Thanks for catching that.
> What this suggests t
I just noticed something very interesting: in a couple of recent
buildfarm runs with this failure, the pg_stat_activity printout
no longer shows the extra walsender:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-03-24%2017%3A50%3A10
https://buildfarm.postgresql.org/cgi-bi
Hi,
On 2022-02-25 12:15:58 -0800, Andres Freund wrote:
> On 2022-02-25 15:07:01 -0500, Tom Lane wrote:
> > Andres Freund writes:
> > Hmm. Maybe put a couple more debug messages into ReplicationSlotCleanup
> > and/or ReplicationSlotDropPtr? It doesn't seem very clear where in that
> > sequence i
Hi,
On 2022-02-25 15:07:01 -0500, Tom Lane wrote:
> Andres Freund writes:
> > Seems to suggest something is holding a problematic lock in a way that I do
> > not understand yet:
>
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-c
Andres Freund writes:
> Seems to suggest something is holding a problematic lock in a way that I do
> not understand yet:
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check
> 2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1
Hi,
On 2022-02-22 18:06:24 -0800, Andres Freund wrote:
> On 2022-02-18 15:14:15 -0800, Andres Freund wrote:
> > I'm running out of ideas for how to try to reproduce this. I think we might
> > need some additional debugging information to get more information from the
> > buildfarm.
>
> > I'm thin
Hi,
I think I did find a bug related to the test, but afaics not the cause of the
test failures we're seeing. See
https://www.postgresql.org/message-id/20220223014855.4lsddr464i7mymk2%40alap3.anarazel.de
I don't think it's related to the problem of this thread, because the logs of
primary3 don't
Hi,
On 2022-02-18 18:49:14 -0500, Tom Lane wrote:
> Andres Freund writes:
> > We'd also need to add pq_endmessage_noblock(), because the pq_endmessage()
> > obviously tries to send (as in the backtrace upthread) if the output buffer
> > is
> > large enough, which it often will be in walsender.
>
Andres Freund writes:
> On 2022-02-18 18:15:21 -0500, Tom Lane wrote:
>> Perhaps it'd be sensible to do this only in debugging (ie Assert)
>> builds?
> That seems not great, because it pretty clearly can lead to hangs, which is
> problematic in tests too. What about using pq_flush_if_writable()?
Hi,
On 2022-02-18 18:15:21 -0500, Tom Lane wrote:
> Andres Freund writes:
> > On 2022-02-17 21:55:21 -0800, Andres Freund wrote:
> >> Isn't it pretty bonkers that we allow error processing to get stuck behind
> >> network traffic, *before* we have have released resources (locks etc)?
>
> It's mor
Andres Freund writes:
> On 2022-02-17 21:55:21 -0800, Andres Freund wrote:
>> Isn't it pretty bonkers that we allow error processing to get stuck behind
>> network traffic, *before* we have have released resources (locks etc)?
It's more or less intentional, per elog.c:
/*
* This flush i
Hi,
On 2022-02-18 14:42:48 -0800, Andres Freund wrote:
> On 2022-02-17 21:55:21 -0800, Andres Freund wrote:
> > Isn't it pretty bonkers that we allow error processing to get stuck behind
> > network traffic, *before* we have have released resources (locks etc)?
>
> This is particularly likely to
Hi,
On 2022-02-17 21:55:21 -0800, Andres Freund wrote:
> Isn't it pretty bonkers that we allow error processing to get stuck behind
> network traffic, *before* we have have released resources (locks etc)?
This is particularly likely to be a problem for walsenders, because they often
have a large
Hi,
On 2022-02-17 19:41:40 -0800, Andres Freund wrote:
> At this point the psql "walsender" is *not* terminated, because the fatal
> message is waiting behind all the WAL data sent. Which also means that the
> slot isn't yet released.
Isn't it pretty bonkers that we allow error processing to get
Andres Freund writes:
> There've not been any new reports in the last 18 hours, but that's probably
> just due to lower commit activity triggering fewer runs.
myna just showed the same symptom:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=myna&dt=2022-02-18%2003%3A00%3A17
# Failed
Hi,
On 2022-02-17 16:34:34 -0800, Andres Freund wrote:
> I've done this for a couple hundred iterations, under load, subsequently with
> additional assertions, without being able to reproduce.
Playing around with this further I did get into one interesting state:
I started psql with replication=
Hi,
On 2022-02-16 20:03:00 -0800, Andres Freund wrote:
> I'll run the test in a loop, perhaps I can reproduce...
I've done this for a couple hundred iterations, under load, subsequently with
additional assertions, without being able to reproduce.
There've not been any new reports in the last 18
Hi,
On 2022-02-16 22:58:19 -0500, Tom Lane wrote:
> I wonder though if an exiting walsender would always take that path.
You're right - the CFI() in PostgresMain(), several replication commands, are
not at all guaranteed to be in a transaction when throwing a FATAL.
I don't quite see where we co
Hi,
On 2022-02-16 19:46:26 -0800, Andres Freund wrote:
> But right now I'm not seeing what prevents us from throwing a FATAL error
> while holding an lwlock?
Nothing. But we register ShutdownPostgres() / ShutdownAuxiliaryProcess() to
take care of that.
I checked and walsenders do have ShutdownPo
Andres Freund writes:
> But right now I'm not seeing what prevents us from throwing a FATAL error
> while holding an lwlock?
If we're inside a transaction, then ShutdownPostgres ->
AbortOutOfAnyTransaction would release LWLocks. I wonder
though if an exiting walsender would always take that path
Hi,
On 2022-02-16 22:11:30 -0500, Tom Lane wrote:
> So (a) it broke around 48 hours ago, which is already a useful
> bit of info
Indeed. 2f6501fa3c54bbe4568e3bcccd9a60d26a46b5ee seems like the obvious commit
to blame.
We document before_shmem_exit hooks as
/*
* Call before_shmem
I wrote:
> So (a) it broke around 48 hours ago, which is already a useful
> bit of info, and (b) your animals seem far more susceptible than
> anyone else's. Why do you suppose that is?
Eyeballing the commit log for potential causes in that time frame,
I can't help noticing
2 days ago Andre
Andres Freund writes:
> On 2022-02-16 20:22:23 -0500, Tom Lane wrote:
>> There's no disconnection log entry for either, which I suppose means
>> that somebody didn't bother logging disconnection for walsenders ...
> The thing is, we actually *do* log disconnection for walsenders:
Ah, my mistake,
Hi,
On 2022-02-16 20:22:23 -0500, Tom Lane wrote:
> There's no disconnection log entry for either, which I suppose means
> that somebody didn't bother logging disconnection for walsenders ...
The thing is, we actually *do* log disconnection for walsenders:
2022-02-16 17:53:55.780 PST [2459806][w
Andres Freund writes:
> I think the test is telling us that something may be broken. We shouldn't
> silence that without at least some understanding what it is.
I looked at the recent failure on komodoensis [1], and I think what is
happening is just that the walsender for the basebackup run (laun
Hi,
On 2022-02-16 18:04:14 +0900, Masahiko Sawada wrote:
> On Wed, Feb 16, 2022 at 3:22 PM Kyotaro Horiguchi
> wrote:
> >
> > At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada
> > wrote in
> > > Or it's possible that the process took a time to clean up the
> > > temporary replication slot?
>
On Wed, Feb 16, 2022 at 3:22 PM Kyotaro Horiguchi
wrote:
>
> At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada
> wrote in
> > Or it's possible that the process took a time to clean up the
> > temporary replication slot?
>
> Checkpointer may take ReplicationSlotControlLock. Dead lock between
>
At Wed, 16 Feb 2022 15:22:27 +0900 (JST), Kyotaro Horiguchi
wrote in
> At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada
> wrote in
> > Or it's possible that the process took a time to clean up the
> > temporary replication slot?
>
> Checkpointer may take ReplicationSlotControlLock. Dead l
At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada
wrote in
> Or it's possible that the process took a time to clean up the
> temporary replication slot?
Checkpointer may take ReplicationSlotControlLock. Dead lock between
ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots
happened?
At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada
wrote in
> On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi
> wrote:
> >
> > At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund
> > wrote in
> > > I guess it's conceivable that the backend was still working through
> > > process
> > > shutd
At Wed, 16 Feb 2022 14:26:37 +0900 (JST), Kyotaro Horiguchi
wrote in
> Agreed. Doing this att all slot creation seems fine.
Done in the attached. The first slot is deliberately created
unreserved so I changed the code to re-create the slot with "reserved"
before taking backup.
> > Even though
On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi
wrote:
>
> At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund wrote
> in
> > I think what happened is that there was no WAL to receive between the start
> > of
> > the primary and the $node_primary3->wait_for_catchup($node_standby3);
> >
> > Beca
At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund wrote in
> I think what happened is that there was no WAL to receive between the start of
> the primary and the $node_primary3->wait_for_catchup($node_standby3);
>
> Because the slot is created without reserving WAL that allows the primary to
> r
Hi,
On 2022-02-15 23:29:20 +0200, Heikki Linnakangas wrote:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-02-14%2006%3A30%3A04
>
> [07:42:23] t/018_wal_optimize.pl ok12403 ms ( 0.00 usr
> 0.00 sys + 1.40 cusr 0.63 csys = 2.03 CPU)
> # poll_qu
While looking at recent failures in the new 028_pitr_timelines.pl
recovery test, I noticed that there have been a few failures in the
buildfarm in the recoveryCheck phase even before that, in the
019_replslot_limit.pl test.
For example:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=
42 matches
Mail list logo