Re-sending the original message due to a fail...

Hi Zhijie Hou, Amit

> > I think the main issue here lies in the possibility that the minimum 
> > restart_lsn
> > obtained during a checkpoint could be less than the WAL position that is 
> > being
> > reserved concurrently. So, instead of serializing the redo assignment and 
> > WAL
> > reservation, Amit proposes serializing the CheckPointReplicationSlots() and 
> > WAL
> > reservation. This would ensure the following:
> >
> > 1) If the WAL reservation occurs first, the checkpoint must wait for the
> > restart_lsn to be updated before proceeding with WAL removal. This 
> > guarantees
> > that the most recent restart_lsn position is detected.
> >
> > 2) If the checkpoint calls CheckPointReplicationSlots() first, then any
> > subsequent WAL reservation must take a position later than the redo pointer.

Thank you for the explanation. I agree, the Amit's patch solves the problem
and it is the most promising solution. It is less risky to new bugs and there
is no need to avoid locks for a maximum possible performance. I tried to find
some corner cases but I failed.

FYI, there is another lock-less solution for ReplicationSlotReserveWal with
two-phase reservation that is attached as a diff file. It seems to solve the
redo rec race condition problem but it is not complete and more risky to bugs.
Just share here with the hope that such approach may be interested.

When investigating the solution I come to some questions. Below I shared them.
I do not ask for an answer but I think, they may be considered when preparing
the patch.

1) Looking at ReplicationSlotReserveWal, I'm not sure why we assign restart_lsn
in a loop and exit the loop if XLogGetLastRemovedSegno() < segno is true. I can
understand if we compare restart_lsn with XLogCtl->replicationSlotMinLSN
to handle parallel calls of ReplicationSlotsComputeRequiredLSN as described
in the comment. The old segments removal happens much later in the checkpointer.
I'm not sure, whether the comment describes the case inproperly or the code is
wrong.

2) Why we call XLogSetReplicationSlotMinimumLSN out of the control lock section.
It may lead to some race conditions when two more backends create, advance or
drop slots in parallel. Not sure, the control and allocation locks properly
serialise the updates.

With regards,
Vitaly
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 4b5b24d3759..0b8c6861edb 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -473,16 +473,17 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	memset(&slot->data, 0, sizeof(ReplicationSlotPersistentData));
 	namestrcpy(&slot->data.name, name);
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
 	slot->data.two_phase_at = InvalidXLogRecPtr;
 	slot->data.failover = failover;
 	slot->data.synced = synced;
+	slot->wal_reserved = false;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
 	slot->dirty = false;
 	slot->effective_xmin = InvalidTransactionId;
 	slot->effective_catalog_xmin = InvalidTransactionId;
 	slot->candidate_catalog_xmin = InvalidTransactionId;
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
@@ -1568,16 +1569,17 @@ CheckSlotPermissions(void)
  * the slot and concurrency safe.
  */
 void
 ReplicationSlotReserveWal(void)
 {
 	ReplicationSlot *slot = MyReplicationSlot;
 
 	Assert(slot != NULL);
+	Assert(slot->wal_reserved == false);
 	Assert(!XLogRecPtrIsValid(slot->data.restart_lsn));
 	Assert(!XLogRecPtrIsValid(slot->last_saved_restart_lsn));
 
 	/*
 	 * The replication slot mechanism is used to prevent removal of required
 	 * WAL. As there is no interlock between this routine and checkpoints, WAL
 	 * segments could concurrently be removed when a now stale return value of
 	 * ReplicationSlotsComputeRequiredLSN() is used. In the unlikely case that
@@ -1613,16 +1615,22 @@ ReplicationSlotReserveWal(void)
 		INJECTION_POINT("physical-slot-reserve-wal-get-redo", NULL);
 
 		SpinLockAcquire(&slot->mutex);
 		slot->data.restart_lsn = restart_lsn;
 		SpinLockRelease(&slot->mutex);
 
 		INJECTION_POINT("physical-slot-reserve-wal-before-compute-required-lsn", NULL);
 
+		// for physical slots only!
+		if (slot->data.restart_lsn < GetRedoRecPtr())
+			continue;
+
+		slot->wal_reserved = true;
+
 		/* prevent WAL removal as fast as possible */
 		ReplicationSlotsComputeRequiredLSN();
 
 		/*
 		 * If all required WAL is still there, great, otherwise retry. The
 		 * slot should prevent further removal of WAL, unless there's a
 		 * concurrent ReplicationSlotsComputeRequiredLSN() after we've written
 		 * the new restart_lsn above, so normally we should never need to loop
@@ -1753,16 +1761,19 @@ DetermineSlotInvalidationCause(uint32 possible_causes, ReplicationSlot *s,
 							   TimestampTz *inactive_since, TimestampTz now)
 {
 	Assert(possible_causes != RS_INVAL_NONE);
 
 	if (possible_causes & RS_INVAL_WAL_REMOVED)
 	{
 		XLogRecPtr	restart_lsn = s->data.restart_lsn;
 
+		if (!s->wal_reserved)
+			return RS_INVAL_NONE;
+
 		if (XLogRecPtrIsValid(restart_lsn) &&
 			restart_lsn < oldestLSN)
 			return RS_INVAL_WAL_REMOVED;
 	}
 
 	if (possible_causes & RS_INVAL_HORIZON)
 	{
 		/* invalid DB oid signals a shared relation */
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 09c69f83d57..1aaf3a275e5 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -168,16 +168,17 @@ typedef struct ReplicationSlot
 	bool		in_use;
 
 	/* Who is streaming out changes for this slot? 0 in unused slots. */
 	pid_t		active_pid;
 
 	/* any outstanding modifications? */
 	bool		just_dirtied;
 	bool		dirty;
+	bool		wal_reserved;
 
 	/*
 	 * For logical decoding, it's extremely important that we never remove any
 	 * data that's still needed for decoding purposes, even after a crash;
 	 * otherwise, decoding will produce wrong answers.  Ordinary streaming
 	 * replication also needs to prevent old row versions from being removed
 	 * too soon, but the worst consequence we might encounter there is
 	 * unwanted query cancellations on the standby.  Thus, for logical

Reply via email to