Hi,

On 5/6/23 3:28 PM, Amit Kapila wrote:
On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
<bertranddrouvot...@gmail.com> wrote:

There is 2 runs with this extra info in place:

A successful one: https://cirrus-ci.com/task/6528745436086272
A failed one: https://cirrus-ci.com/task/4558139312308224


Thanks, I think I got some clue as to why this test is failing
randomly.

Great, thanks!

Observations:
============
1. In the failed run, the KeepLogSeg(), reduced the _logSegNo to 3
which is the reason for the failure because now the standby won't be
able to remove/recycle the WAL file corresponding to segment number 3
which the test was expecting.

Agree.

2.  We didn't expect the KeepLogSeg() to reduce the _logSegNo because
all logical slots were invalidated. However, I think we forgot that
both standby and primary have physical slots which might also
influence the XLogGetReplicationSlotMinimumLSN() calculation in
KeepLogSeg().

Oh right...

Next steps:
=========
1. The first thing is we should verify this theory by adding some LOG
in KeepLogSeg() to see if the _logSegNo is reduced due to the value
returned by XLogGetReplicationSlotMinimumLSN().

Yeah, will do that early next week.

2. The reason for the required file not being removed in the primary
is also that it has a physical slot which prevents the file removal.

Yeah, agree. But this one is not an issue as we are not
checking for the WAL file removal on the primary, do you agree?

3. If the above theory is correct then I see a few possibilities to
fix this test (a) somehow ensure that restart_lsn of the physical slot
on standby is advanced up to the point where we can safely remove the
required files; (b) just create a separate test case by initializing a
fresh node for primary and standby where we only have logical slots on
standby. This will be a bit costly but probably less risky. (c) any
better ideas?


(c): Since, I think, the physical slot on the primary is not a concern for
the reason mentioned above, then instead of (b):

What about postponing the physical slot creation on the standby and the
cascading standby node initialization after this test?

That way, this test would be done without a physical slot on the standby and
we could also get rid of the "Wait for the cascading standby to catchup before
removing the WAL file(s)" part.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


Reply via email to