Dear Hayato, All

On Friday, October 03, 2025 14:14 MSK, "Hayato Kuroda (Fujitsu)" 
<[email protected]> wrote:
>> I'm working on the issue. Give me, please, a couple of days to finalize my 
>> work.
> Oh, sorry. I was rude.
It is okay. I very appreciate your help.

> Per my understanding, this happened because there is a lag that restart_lsn of
> the slot is set, and it is protected by the system. Your idea is to ensure the
> restart_lsn is protected by the system before obtaining on-memory LSN, right?

Not sure what you mean by on-memory LSN, but, the issue happens because we have
a lag between restart_lsn assignment and update of
XLogCtl->replicationSlotsMinLSN which is used to protect the WAL. Yes, I propose
to ensure that the protection happens when we assign restart_lsn. It seems to be
wrong that we invalidate slots by its restart_lsn but protect the wal for
slots using XLogCtl->replicationSlotsMinLSN.

Below I tried to write some summary and propose the patch which fixes the 
problem.

The issue was originally reported at [1] and it seems to appear in 17 and
earlier versions. The issue is not reproducible in 18+ versions.

The issue may appear when we create a persistent slot during checkpoint. The WAL
reservation in slots happens in ReplicationSlotReserveWal and executed in three
steps:
    1. Assignment of slot->data.restart_lsn
    2. Update of XLogCtl->replicationSlotMinLSN
    3. Check if WAL segments at restart_lsn are removed, go to step 1 if 
removed.

When the checkpointer calculates the oldest lsn which is used as the lsn horizon
when removing old WAL segments, it takes XLogCtl->replicationSlotMinLSN.
There is a race condition may happen when slot's restart_lsn is already assigned
but XLogCtl->replicationSlotMinLSN is not updated yet. Consider the following
scenario with two processes executing in parallel (checkpointer and backend,
where a new slot is creating):

1. Assign of slot.data->restart_lsn in the backend from GetRedoRecPtr()
2. Assign a new redo LSN in the checkpointer
3. Assign slotsMinReqLSN from XLogCtl->replicationSlotMinLSN in the checkpointer
4. Update of XLogCtl->replicationSlotMinLSN in the backend.
5. Calculation of the WAL horizon for old segments cleanup (KeepLogSeg before
   the call of InvalidateObsoleteReplicationSlots) in the checkpointer.
6. Exit from ReplicationSlotReserveWal in the backend, once the reserved WAL
   segments are not removed at this moment (XLogGetLastRemovedSegno() < segno).
7. Call of InvalidateObsoleteReplicationSlots in the checkpointer will 
invalidate
   the creating slot because its restart_lsn will be less than the calculated
   WAL horizon (the min of slotsMinReqLSN and RedoRecPtr).

To fix the issue I propose to consider the following assumptions:
1. Slots do not cross WAL segment borders backward when moving.
2. Old WAL segments are removed in the checkpointer only.
3. The following LSNs are initially assigned during slot reservation:
    - GetRedoRecPtr() for physical slots
    - GetXLogInsertRecPtr() for logical slots
    - GetXLogReplayRecPtr() for logical slots in recovery

Taking into account these assumptions, I would like to propose the fix [2].

There is an idea to think that the WAL reservation happens when we assign
restart_lsn to the slot. The call of ReplicationSlotsComputeRequiredLSN() is not
required to be executed immediately in the backend where the slot is creating
concurrently. In the checkpointer we have to guarantee that we do WAL horizon
calculations based on actual values of restart_lsn of existing slots. If
we call ReplicationSlotsComputeRequiredLSN() in the checkpointer after a new
REDO assignment and before the calculation of WAL horizon, the value of
XLogCtl->replicationSlotMinLSN will correctly define the oldest LSN for existing
slots. If the WAL reservation by a new slot happens during checkpoint before
a new REDO assignment, it is guaranteed that its restart_lsn will be accounted
when we call ReplicationSlotsComputeRequiredLSN() in the checkpointer. If the
WAL reservation happens after a new redo LSN assignment, the slot's restart_lsn
will be protected by this new redo LSN, because this LSN will be lesser or equal
to initial restart_lsn (see assumption 3).

There is one subtle thing. Once, the operation of restart_lsn assignment is not
an atomic, the following scenario may happen theoretically:
1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
2. Assign a new redo LSN in the checkpointer
3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
3. Assign the old redo LSN to restart_lsn

In this scenario, the restart_lsn will point to a previous redo LSN and it will
be not protected by the new redo LSN. This scenario is unlikely, but it can
happen theoretically. I have no ideas how to deal with it, except of assigning
restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

In case of recovery, when GetXLogReplayRecPtr() is used, the protection by
redo LSN seems to work as well, because a new redo LSN is taken from the latest
replayed checkpoint. Thus, it is guaranteed that GetXLogReplayRecPtr() will not
be less than the new redo LSN, if it is called right after assignment of redo
LSN in CreateRestartPoint().

I also think that the cycle in ReplicationSlotReserveWal which checks for the
current restart_lsn to be greater than the XLogGetLastRemovedSegno() is not
necessary because it is guaranteed that the assigned restart_lsn will be
protected. Lets keep it unchanged until this suggestion will be clarified.

The proposed solution doesn't break the fix in ca307d5cec (unexpected removal of
old WAL segments after checkpoint). Once we call
ReplicationSlotsComputeRequiredLSN() before CheckPointReplicationSlots(), the
saved to disk restart_lsn values of existing slots will be not less than
the previously computed XLogCtl->replicationSlotMinLSN. They just may be
advanced to greater values concurrently. For new slots with restart_lsn
assignment after ReplicationSlotsComputeRequiredLSN(), the current redo LSN will
protect the WAL.

The fix for REL_17_STABLE is in [2]. The regression test is in [3].

I apologize for so long summary.

[1] 
https://www.postgresql.org/message-id/flat/15922-68ca9280-4f-37de2c40%40245457797#4b7aa7fe7c57b02105a56ecff06f0b67
[2] v3-0002-Fix-invalidation-when-slot-is-created-during-checkpo.patch
[3] v3-0001-Newly-created-replication-slot-may-be-invalidated-by.patch

With best regards,
Vitaly
From bc40a1d0d470917e44d06caab52b482e907f2d78 Mon Sep 17 00:00:00 2001
From: Vitaly Davydov <[email protected]>
Date: Mon, 6 Oct 2025 16:20:45 +0300
Subject: [PATCH 1/2] Newly created replication slot may be invalidated by
 checkpoint

Commit 2090edc6f32f652a2c995ca5f7e65748ae1e4c5d introduced a change
that the minimal restart_lsn is obtained at the start of checkpoint
creation. If a replication slot is created and performs a WAL
reservation concurrently, the WAL segment contains the new slot's
restart_lsn could be removed by the ongoing checkpoint. Add a perl
test to reproduce this scenario.

Author: suyu.cmj <[email protected]>
Author: Vitaly Davydov <[email protected]>
---
 src/backend/replication/slot.c                |   3 +
 .../recovery/t/049_invalidate_new_slot.pl     | 188 ++++++++++++++++++
 2 files changed, 191 insertions(+)
 create mode 100644 src/test/recovery/t/049_invalidate_new_slot.pl

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index b9e2b115dab..ee0ab2d8c83 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -57,6 +57,7 @@
 #include "utils/builtins.h"
 #include "utils/guc_hooks.h"
 #include "utils/varlena.h"
+#include "utils/injection_point.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1443,6 +1444,8 @@ ReplicationSlotReserveWal(void)
 		slot->data.restart_lsn = restart_lsn;
 		SpinLockRelease(&slot->mutex);
 
+		INJECTION_POINT("slot-reserve-wal-delay");
+
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
 
diff --git a/src/test/recovery/t/049_invalidate_new_slot.pl b/src/test/recovery/t/049_invalidate_new_slot.pl
new file mode 100644
index 00000000000..a69df941762
--- /dev/null
+++ b/src/test/recovery/t/049_invalidate_new_slot.pl
@@ -0,0 +1,188 @@
+# This test checks that the new slot maybe invalidated by checkpoint
+#
+
+use strict;
+use warnings;
+use Config;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+my $res;
+
+# Setup primary node
+my $node = PostgreSQL::Test::Cluster->new('primary');
+$node->init(allows_streaming => 1);
+$node->append_conf('postgresql.conf', q{wal_level = logical});
+$node->start();
+$node->safe_psql('postgres', q{create extension injection_points;});
+
+my $createslot = $node->background_psql('postgres');
+my $checkpoint = $node->background_psql('postgres');
+
+sub physreplslot_start_create()
+{
+	$createslot->query_until(
+		qr/start-create-slot/,
+		q(\echo start-create-slot
+		select pg_create_physical_replication_slot('standby1', true);
+	));
+}
+
+sub logreplslot_start_create()
+{
+	$createslot->query_until(
+		qr/start-create-slot/,
+		q(\echo start-create-slot
+		select pg_log_standby_snapshot();
+		select pg_create_logical_replication_slot('standby1', 'pgoutput');
+	));
+}
+
+sub physreplslot_create()
+{
+	$node->safe_psql('postgres', q{
+		select pg_create_physical_replication_slot('standby1', true)
+	});
+}
+
+sub checkpoint_start()
+{
+	$checkpoint->query_until(
+		qr/start-checkpoint/,
+		q(\echo start-checkpoint
+		checkpoint;
+	));
+}
+
+sub injection_point_attach($$)
+{
+	my ($injectpname, $event) = @_;
+
+	$node->safe_psql('postgres', "select injection_points_attach('$injectpname','$event')");
+}
+
+sub injection_point_detach($)
+{
+	my $injectpname = shift;
+
+	$node->safe_psql('postgres', "select injection_points_detach('$injectpname');");
+}
+
+sub injection_point_wait($$)
+{
+	my ($subsystem, $injectpname) = @_;
+
+	note("waiting for $injectpname injection point");
+	$node->wait_for_event($subsystem, $injectpname);
+	note("injection_point $injectpname is reached");
+}
+
+sub injection_point_wakeup($)
+{
+	my $injectpname = shift;
+
+	$node->safe_psql('postgres', "select injection_points_wakeup('$injectpname');");
+}
+
+sub get_slot_invalidation_state($)
+{
+	my $slotname = shift;
+
+	return $node->safe_psql('postgres',
+		"select invalidation_reason from pg_replication_slots where slot_name='$slotname'");
+}
+
+# ------------------------------------------------------------------------------
+# Test: start physical slot creation before checkpoint (before new redo lsn assignment)
+# ------------------------------------------------------------------------------
+
+injection_point_attach('slot-reserve-wal-delay', 'wait');
+injection_point_attach('checkpoint-before-old-wal-removal', 'wait');
+
+physreplslot_start_create();
+$node->advance_wal(3);
+checkpoint_start();
+injection_point_wait('client backend', 'slot-reserve-wal-delay');
+injection_point_wait('checkpointer', 'checkpoint-before-old-wal-removal');
+injection_point_wakeup('slot-reserve-wal-delay');
+injection_point_wakeup('checkpoint-before-old-wal-removal');
+
+eval { $createslot->quit(); };
+eval { $checkpoint->quit(); };
+
+is(get_slot_invalidation_state('standby1'), "", "slot is not invalidated by checkpoint (1)");
+
+injection_point_detach('slot-reserve-wal-delay');
+injection_point_detach('checkpoint-before-old-wal-removal');
+$node->safe_psql('postgres', "select pg_drop_replication_slot('standby1')");
+
+# ------------------------------------------------------------------------------
+# Test: start physical slot creation during checkpoint (after new redo lsn assignment)
+# ------------------------------------------------------------------------------
+
+injection_point_attach('checkpoint-before-old-wal-removal', 'wait');
+
+$node->advance_wal(3);
+checkpoint_start();
+injection_point_wait('checkpointer', 'checkpoint-before-old-wal-removal');
+physreplslot_create();
+injection_point_wakeup('checkpoint-before-old-wal-removal');
+
+eval { $checkpoint->quit(); };
+
+is(get_slot_invalidation_state('standby1'), "", "slot is not invalidated by checkpoint (2)");
+
+injection_point_detach('checkpoint-before-old-wal-removal');
+$node->safe_psql('postgres', "select pg_drop_replication_slot('standby1')");
+
+# ------------------------------------------------------------------------------
+# Test: start logical slot creation before checkpoint (before new redo lsn assignment)
+# ------------------------------------------------------------------------------
+
+injection_point_attach('slot-reserve-wal-delay', 'wait');
+injection_point_attach('checkpoint-before-old-wal-removal', 'wait');
+
+logreplslot_start_create();
+$node->advance_wal(3);
+checkpoint_start();
+injection_point_wait('client backend', 'slot-reserve-wal-delay');
+injection_point_wait('checkpointer', 'checkpoint-before-old-wal-removal');
+injection_point_wakeup('slot-reserve-wal-delay');
+injection_point_wakeup('checkpoint-before-old-wal-removal');
+
+eval { $createslot->quit(); };
+eval { $checkpoint->quit(); };
+
+is(get_slot_invalidation_state('standby1'), "", "slot is not invalidated by checkpoint (3)");
+
+injection_point_detach('slot-reserve-wal-delay');
+injection_point_detach('checkpoint-before-old-wal-removal');
+$node->safe_psql('postgres', "select pg_drop_replication_slot('standby1')");
+
+# ------------------------------------------------------------------------------
+# Test: start logical slot creation during checkpoint (after new redo lsn assignment)
+# ------------------------------------------------------------------------------
+
+injection_point_attach('checkpoint-before-old-wal-removal', 'wait');
+
+$node->advance_wal(3);
+checkpoint_start();
+injection_point_wait('checkpointer', 'checkpoint-before-old-wal-removal');
+logreplslot_start_create();
+injection_point_wakeup('checkpoint-before-old-wal-removal');
+
+eval { $createslot->quit(); };
+eval { $checkpoint->quit(); };
+
+is(get_slot_invalidation_state('standby1'), "", "slot is not invalidated by checkpoint (4)");
+
+injection_point_detach('checkpoint-before-old-wal-removal');
+$node->safe_psql('postgres', "select pg_drop_replication_slot('standby1')");
+
+done_testing();
-- 
2.34.1

From 8b383fc8a655f2a45bcdeed3096a2bb88aba5687 Mon Sep 17 00:00:00 2001
From: Vitaly Davydov <[email protected]>
Date: Mon, 6 Oct 2025 16:36:29 +0300
Subject: [PATCH 2/2] Fix invalidation when slot is created during checkpoint

If the slot is creating during checkpoint, it may be invalidated by the
checkpointer because of the lag between slot's restart_lsn assignment
and update of the XLogCtl->replicationSlotMinLSN. The issue seems to
have place in past, but the commit 2090edc6f32f652a2c increased the
probability of slot invalidation.

To fix the issue we explicitly update XLogCtl->replicationSlotMinLSN in
the checkpointer after a new redo LSN assignment. It guarantees that the
slot's restart_lsn is taken into account in calculation of the oldest
LSN for WAL segments removal. For slots, which restart_lsn is assigned
after the update, the current redo LSN will protect the WAL for the
slot.

Discussion: https://www.postgresql.org/message-id/flat/5e045179-236f-4f8f-84f1-0f2566ba784c.mengjuan.cmj%40alibaba-inc.com
---
 src/backend/access/transam/xlog.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 287339b7fae..8f7496ce43f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7103,12 +7103,18 @@ CreateCheckPoint(int flags)
 	END_CRIT_SECTION();
 
 	/*
-	 * Get the current minimum LSN to be used later in the WAL segment
-	 * cleanup.  We may clean up only WAL segments, which are not needed
-	 * according to synchronized LSNs of replication slots.  The slot's LSN
-	 * might be advanced concurrently, so we call this before
+	 * Update slots' oldest reserved lsn and save it in slotsMinReqLSN to be
+	 * used later in the WAL segment cleanup. We may clean up only WAL segments,
+	 * which are not needed according to synchronized LSNs of replication slots.
+	 * The slot's LSN might be advanced concurrently, so we call this before
 	 * CheckPointReplicationSlots() synchronizes replication slots.
+	 *
+	 * The update of slots' oldest LSN is required to guarantee that the slot's
+	 * restart_lsn is taken into account in calculation of the oldest LSN for
+	 * WAL segments removal. If the slot's restart_lsn is assigned after this
+	 * update, the current redo LSN will protect the WAL for the slot.
 	 */
+	ReplicationSlotsComputeRequiredLSN();
 	slotsMinReqLSN = XLogGetReplicationSlotMinimumLSN();
 
 	/*
@@ -7676,12 +7682,18 @@ CreateRestartPoint(int flags)
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
 	/*
-	 * Get the current minimum LSN to be used later in the WAL segment
-	 * cleanup.  We may clean up only WAL segments, which are not needed
-	 * according to synchronized LSNs of replication slots.  The slot's LSN
-	 * might be advanced concurrently, so we call this before
+	 * Update slots' oldest reserved lsn and save it in slotsMinReqLSN to be
+	 * used later in the WAL segment cleanup. We may clean up only WAL segments,
+	 * which are not needed according to synchronized LSNs of replication slots.
+	 * The slot's LSN might be advanced concurrently, so we call this before
 	 * CheckPointReplicationSlots() synchronizes replication slots.
+	 *
+	 * The update of slots' oldest LSN is required to guarantee that the slot's
+	 * restart_lsn is taken into account in calculation of the oldest LSN for
+	 * WAL segments removal. If the slot's restart_lsn is assigned after this
+	 * update, the current redo LSN will protect the WAL for the slot.
 	 */
+	ReplicationSlotsComputeRequiredLSN();
 	slotsMinReqLSN = XLogGetReplicationSlotMinimumLSN();
 
 	if (log_checkpoints)
-- 
2.34.1

Reply via email to