Re: Timeout failure in 019_replslot_limit.pl

2021-10-05 Thread Michael Paquier
On Sat, Oct 02, 2021 at 07:00:01PM -0300, Alvaro Herrera wrote: > A patch was proposed on that thread on September 22nd, can to try with > that and see if this problem still reproduces? Yes, the failure still shows up, even with a timeout set at 30s which is the default of the patch. -- Michael

Re: Timeout failure in 019_replslot_limit.pl

2021-10-02 Thread Alvaro Herrera
On 2021-Sep-27, Michael Paquier wrote: > I got again a failure today, so I have used this occasion to check that > when the checkpoint gets stuck the WAL sender process getting SIGCONT > is still around, waiting for a write to happen: > * thread #1, queue = 'com.apple.main-thread', stop reason = s

Re: Timeout failure in 019_replslot_limit.pl

2021-09-27 Thread Amit Kapila
On Mon, Sep 27, 2021 at 12:13 PM Michael Paquier wrote: > > On Mon, Sep 27, 2021 at 11:53:07AM +0530, Amit Kapila wrote: > > So, it seems on your machine it has passed the following condition in > > secure_write: > > if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN)) > > Ye

Re: Timeout failure in 019_replslot_limit.pl

2021-09-26 Thread Michael Paquier
On Mon, Sep 27, 2021 at 11:53:07AM +0530, Amit Kapila wrote: > So, it seems on your machine it has passed the following condition in > secure_write: > if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN)) Yep. > If so, this indicates write failure which seems odd to me and pr

Re: Timeout failure in 019_replslot_limit.pl

2021-09-26 Thread Amit Kapila
On Mon, Sep 27, 2021 at 11:32 AM Michael Paquier wrote: > > On Sat, Sep 25, 2021 at 05:12:42PM +0530, Amit Kapila wrote: > > Now, in the failed run, it appears that due to some reason WAL sender > > has not released the slot. Is it possible to see if the WAL sender is > > still alive when a checkp

Re: Timeout failure in 019_replslot_limit.pl

2021-09-26 Thread Michael Paquier
On Sat, Sep 25, 2021 at 05:12:42PM +0530, Amit Kapila wrote: > Now, in the failed run, it appears that due to some reason WAL sender > has not released the slot. Is it possible to see if the WAL sender is > still alive when a checkpoint is stuck at ConditionVariableSleep? And > if it is active, wha

Re: Timeout failure in 019_replslot_limit.pl

2021-09-25 Thread Amit Kapila
On Wed, Sep 22, 2021 at 12:57 PM Michael Paquier wrote: > > On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote: > > On 2021-Sep-20, Michael Paquier wrote: > >> The test gets the right PIDs, as the logs showed: > >> ok 17 - have walsender pid 12663 > >> ok 18 - have walreceiver pid 1266

Re: Timeout failure in 019_replslot_limit.pl

2021-09-22 Thread Michael Paquier
On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote: > On 2021-Sep-20, Michael Paquier wrote: >> The test gets the right PIDs, as the logs showed: >> ok 17 - have walsender pid 12663 >> ok 18 - have walreceiver pid 12662 > > As I understood, Horiguchi-san's point isn't that the PIDs mig

Re: Timeout failure in 019_replslot_limit.pl

2021-09-20 Thread Michael Paquier
On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote: > On 2021-Sep-20, Michael Paquier wrote: >>> If that doesn't work, let's try Horiguchi-san's idea of using some >>> `ps` flags to find the process. >> >> Tried this one as well, to see the same failure. > > Hmm, do you mean that you

Re: Timeout failure in 019_replslot_limit.pl

2021-09-20 Thread Alvaro Herrera
On 2021-Sep-20, Michael Paquier wrote: > > Can you please first test if the idea of sending the signal twice is > > enough? > > This idea does not work. I got one failure after 5 tries. OK, thanks for taking the time to test it. > > If that doesn't work, let's try Horiguchi-san's idea of using

Re: Timeout failure in 019_replslot_limit.pl

2021-09-20 Thread Michael Paquier
On Sat, Sep 18, 2021 at 05:19:04PM -0300, Alvaro Herrera wrote: > Hmm, sounds a possibly useful idea to explore, but I would only do so if > the other ideas prove fruitless, because it sounds like it'd have more > moving parts. Can you please first test if the idea of sending the signal > twice is

Re: Timeout failure in 019_replslot_limit.pl

2021-09-18 Thread Alvaro Herrera
On 2021-Sep-18, Michael Paquier wrote: > Could it be possible to rely on a combination of wait events set in WAL > senders and pg_stat_replication to assume that a WAL sender is in a > stopped state? Hmm, sounds a possibly useful idea to explore, but I would only do so if the other ideas prove fr

Re: Timeout failure in 019_replslot_limit.pl

2021-09-17 Thread Michael Paquier
On Fri, Sep 17, 2021 at 08:41:00PM -0700, Noah Misch wrote: > If this fixes things for the OP, I'd like it slightly better than the "ps" > approach. It's less robust, but I like the brevity. > > Another alternative might be to have walreceiver reach walsender via a proxy > Perl script. Then, mak

Re: Timeout failure in 019_replslot_limit.pl

2021-09-17 Thread Noah Misch
On Fri, Sep 17, 2021 at 06:59:24PM -0300, Alvaro Herrera wrote: > On 2021-Sep-07, Kyotaro Horiguchi wrote: > > It seems like the "kill 'STOP'" in the script didn't suspend the > > processes before advancing WAL. The attached uses 'ps' command to > > check that since I didn't come up with the way to

Re: Timeout failure in 019_replslot_limit.pl

2021-09-17 Thread Alvaro Herrera
On 2021-Sep-07, Kyotaro Horiguchi wrote: > It seems like the "kill 'STOP'" in the script didn't suspend the > processes before advancing WAL. The attached uses 'ps' command to > check that since I didn't come up with the way to do the same in Perl. Ah! so we tell the kernel to send the signal, bu

Re: Timeout failure in 019_replslot_limit.pl

2021-09-06 Thread Kyotaro Horiguchi
At Tue, 7 Sep 2021 09:37:10 +0900, Michael Paquier wrote in > On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote: > > I scraped the buildfarm logs looking for similar failures, and didn't > > find any. (019_replslot_limit.pl hasn't failed at all in the farm > > since the last fix it recei

Re: Timeout failure in 019_replslot_limit.pl

2021-09-06 Thread Michael Paquier
On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote: > I scraped the buildfarm logs looking for similar failures, and didn't > find any. (019_replslot_limit.pl hasn't failed at all in the farm > since the last fix it received, in late July.) The interesting bits are in 019_replslot_limit_pri

Re: Timeout failure in 019_replslot_limit.pl

2021-09-06 Thread Tom Lane
Alvaro Herrera writes: > On 2021-Sep-06, Michael Paquier wrote: >> # poll_query_until timed out executing this query: >> # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3' > Hmm, I've never seen that, and I do run tests in parallel quite often. I scraped the buildfarm logs lo

Re: Timeout failure in 019_replslot_limit.pl

2021-09-06 Thread Alvaro Herrera
Hello On 2021-Sep-06, Michael Paquier wrote: > Running the recovery tests in a parallel run, enough to bloat a > machine in resources, sometimes leads me to the following failure: > ok 19 - walsender termination logged > # poll_query_until timed out executing this query: > # SELECT wal_status FRO

Timeout failure in 019_replslot_limit.pl

2021-09-05 Thread Michael Paquier
Hi all, Running the recovery tests in a parallel run, enough to bloat a machine in resources, sometimes leads me to the following failure: ok 19 - walsender termination logged # poll_query_until timed out executing this query: # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'