On Sat, Oct 02, 2021 at 07:00:01PM -0300, Alvaro Herrera wrote:
> A patch was proposed on that thread on September 22nd, can to try with
> that and see if this problem still reproduces?
Yes, the failure still shows up, even with a timeout set at 30s which
is the default of the patch.
--
Michael
On 2021-Sep-27, Michael Paquier wrote:
> I got again a failure today, so I have used this occasion to check that
> when the checkpoint gets stuck the WAL sender process getting SIGCONT
> is still around, waiting for a write to happen:
> * thread #1, queue = 'com.apple.main-thread', stop reason = s
On Mon, Sep 27, 2021 at 12:13 PM Michael Paquier wrote:
>
> On Mon, Sep 27, 2021 at 11:53:07AM +0530, Amit Kapila wrote:
> > So, it seems on your machine it has passed the following condition in
> > secure_write:
> > if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
>
> Ye
On Mon, Sep 27, 2021 at 11:53:07AM +0530, Amit Kapila wrote:
> So, it seems on your machine it has passed the following condition in
> secure_write:
> if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
Yep.
> If so, this indicates write failure which seems odd to me and pr
On Mon, Sep 27, 2021 at 11:32 AM Michael Paquier wrote:
>
> On Sat, Sep 25, 2021 at 05:12:42PM +0530, Amit Kapila wrote:
> > Now, in the failed run, it appears that due to some reason WAL sender
> > has not released the slot. Is it possible to see if the WAL sender is
> > still alive when a checkp
On Sat, Sep 25, 2021 at 05:12:42PM +0530, Amit Kapila wrote:
> Now, in the failed run, it appears that due to some reason WAL sender
> has not released the slot. Is it possible to see if the WAL sender is
> still alive when a checkpoint is stuck at ConditionVariableSleep? And
> if it is active, wha
On Wed, Sep 22, 2021 at 12:57 PM Michael Paquier wrote:
>
> On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote:
> > On 2021-Sep-20, Michael Paquier wrote:
> >> The test gets the right PIDs, as the logs showed:
> >> ok 17 - have walsender pid 12663
> >> ok 18 - have walreceiver pid 1266
On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote:
> On 2021-Sep-20, Michael Paquier wrote:
>> The test gets the right PIDs, as the logs showed:
>> ok 17 - have walsender pid 12663
>> ok 18 - have walreceiver pid 12662
>
> As I understood, Horiguchi-san's point isn't that the PIDs mig
On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote:
> On 2021-Sep-20, Michael Paquier wrote:
>>> If that doesn't work, let's try Horiguchi-san's idea of using some
>>> `ps` flags to find the process.
>>
>> Tried this one as well, to see the same failure.
>
> Hmm, do you mean that you
On 2021-Sep-20, Michael Paquier wrote:
> > Can you please first test if the idea of sending the signal twice is
> > enough?
>
> This idea does not work. I got one failure after 5 tries.
OK, thanks for taking the time to test it.
> > If that doesn't work, let's try Horiguchi-san's idea of using
On Sat, Sep 18, 2021 at 05:19:04PM -0300, Alvaro Herrera wrote:
> Hmm, sounds a possibly useful idea to explore, but I would only do so if
> the other ideas prove fruitless, because it sounds like it'd have more
> moving parts. Can you please first test if the idea of sending the signal
> twice is
On 2021-Sep-18, Michael Paquier wrote:
> Could it be possible to rely on a combination of wait events set in WAL
> senders and pg_stat_replication to assume that a WAL sender is in a
> stopped state?
Hmm, sounds a possibly useful idea to explore, but I would only do so if
the other ideas prove fr
On Fri, Sep 17, 2021 at 08:41:00PM -0700, Noah Misch wrote:
> If this fixes things for the OP, I'd like it slightly better than the "ps"
> approach. It's less robust, but I like the brevity.
>
> Another alternative might be to have walreceiver reach walsender via a proxy
> Perl script. Then, mak
On Fri, Sep 17, 2021 at 06:59:24PM -0300, Alvaro Herrera wrote:
> On 2021-Sep-07, Kyotaro Horiguchi wrote:
> > It seems like the "kill 'STOP'" in the script didn't suspend the
> > processes before advancing WAL. The attached uses 'ps' command to
> > check that since I didn't come up with the way to
On 2021-Sep-07, Kyotaro Horiguchi wrote:
> It seems like the "kill 'STOP'" in the script didn't suspend the
> processes before advancing WAL. The attached uses 'ps' command to
> check that since I didn't come up with the way to do the same in Perl.
Ah! so we tell the kernel to send the signal, bu
At Tue, 7 Sep 2021 09:37:10 +0900, Michael Paquier wrote
in
> On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote:
> > I scraped the buildfarm logs looking for similar failures, and didn't
> > find any. (019_replslot_limit.pl hasn't failed at all in the farm
> > since the last fix it recei
On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote:
> I scraped the buildfarm logs looking for similar failures, and didn't
> find any. (019_replslot_limit.pl hasn't failed at all in the farm
> since the last fix it received, in late July.)
The interesting bits are in 019_replslot_limit_pri
Alvaro Herrera writes:
> On 2021-Sep-06, Michael Paquier wrote:
>> # poll_query_until timed out executing this query:
>> # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
> Hmm, I've never seen that, and I do run tests in parallel quite often.
I scraped the buildfarm logs lo
Hello
On 2021-Sep-06, Michael Paquier wrote:
> Running the recovery tests in a parallel run, enough to bloat a
> machine in resources, sometimes leads me to the following failure:
> ok 19 - walsender termination logged
> # poll_query_until timed out executing this query:
> # SELECT wal_status FRO
Hi all,
Running the recovery tests in a parallel run, enough to bloat a
machine in resources, sometimes leads me to the following failure:
ok 19 - walsender termination logged
# poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
20 matches
Mail list logo