Re: [PoC] pg_upgrade: allow to upgrade publisher node

Julien Rouhaud Thu, 13 Apr 2023 23:13:10 -0700

Hi,

Sorry for the delay, I didn't had time to come back to it until this afternoon.

On Mon, Apr 10, 2023 at 09:18:46AM +0000, Hayato Kuroda (Fujitsu) wrote:
>
> I have analyzed about the point but it seemed to be difficult. This is because
> some additional records like followings may be inserted. PSA the script which 
> is
> used for testing. Note that "double CHECKPOINT_SHUTDOWN" issue might be wrong,
> so I wanted to withdraw it once. Sorry for noise.
>
> * HEAP/HEAP2 records. These records may be inserted by checkpointer.
>
> IIUC, if there are tuples which have not been flushed yet when shutdown is 
> requested,
> the checkpointer writes back all of them into heap file. At that time many WAL
> records are generated. I think we cannot predict the number of records 
> beforehand.
>
> * INVALIDATION(S) records. These records may be inserted by VACUUM.
>
> There is a possibility that autovacuum runs and generate WAL records. I think 
> we
> cannot predict the number of records beforehand because it depends on the 
> number
> of objects.
>
> * RUNNING_XACTS record
>
> It might be a timing issue, but I found that sometimes background writer 
> generated
> a XLOG_RUNNING record. According to the function BackgroundWriterMain(), it 
> will be
> generated when the process spends 15 seconds since last logging and there are
> important records. I think it is difficult to predict whether this will be 
> appeared or not.

I don't think that your analysis is correct.  Slots are guaranteed to be
stopped after all the normal backends have been stopped, exactly to avoid such
extraneous records.

What is happening here is that the slot's confirmed_flush_lsn is properly
updated in memory and ends up being the same as the current LSN before the
shutdown.  But as it's a logical slot and those records aren't decoded, the
slot isn't marked as dirty and therefore isn't saved to disk.  You don't see
that behavior when doing a manual checkpoint before (per your script comment),
as in that case the checkpoint also tries to save the slot to disk but then
finds a slot that was marked as dirty and therefore saves it.

In your script's scenario, when you restart the server the previous slot data
is restored and the confirmed_flush_lsn goes backward, which explains those
extraneous records.

It's probably totally harmless to throw away that value for now (and probably
also doesn't lead to crazy amount of work after restart, I really don't know
much about the logical slot code), but clearly becomes problematic with your
usecase.  One easy way to fix this is to teach the checkpoint code to force
saving the logical slots to disk even if they're not marked as dirty during a
shutdown checkpoint, as done in the attached v1 patch (renamed as .txt to not
interfere with the cfbot).  With this patch applied I reliably only see a final
shutdown checkpoint record with your scenario.

Now such a change will make shutdown a bit more expensive when using logical
replication, even if in 99% of cases you will not need to save the
confirmed_flush_lsn value, so I don't know if that's acceptable or not.

>From 77c3d2d361893de857627e036d0eaaf01cfe91c1 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <[email protected]>
Date: Fri, 14 Apr 2023 13:49:09 +0800
Subject: [PATCH v1] Always persist to disk logical slots during a shutdown
 checkpoint.

It's entirely possible for a logical slot to have a confirmed_flush_lsn higher
than the last value saved on disk while not being marked as dirty.  It's
currently not a problem to lose that value during a clean shutdown / restart
cycle, but a later patch adding support for pg_upgrade of publications and
logical slots will rely on that value being properly persisted to disk.

Author: Julien Rouhaud
Reviewed-by: FIXME
Discussion: FIXME
---
 src/backend/access/transam/xlog.c |  2 +-
 src/backend/replication/slot.c    | 26 ++++++++++++++++----------
 src/include/replication/slot.h    |  2 +-
 3 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index b540ee293b..8100ca656e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7011,7 +7011,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
        CheckPointRelationMap();
-       CheckPointReplicationSlots();
+       CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
        CheckPointSnapBuild();
        CheckPointLogicalRewriteHeap();
        CheckPointReplicationOrigin();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8021aaa0a8..2bbf2af770 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -109,7 +109,8 @@ static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 /* internal persistency functions */
 static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
-static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
+static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+                                                       bool is_shutdown);
 
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
@@ -783,7 +784,7 @@ ReplicationSlotSave(void)
        Assert(MyReplicationSlot != NULL);
 
        sprintf(path, "pg_replslot/%s", NameStr(MyReplicationSlot->data.name));
-       SaveSlotToPath(MyReplicationSlot, path, ERROR);
+       SaveSlotToPath(MyReplicationSlot, path, ERROR, false);
 }
 
 /*
@@ -1565,11 +1566,10 @@ restart:
 /*
  * Flush all replication slots to disk.
  *
- * This needn't actually be part of a checkpoint, but it's a convenient
- * location.
+ * is_shutdown is true in case of a shutdown checkpoint.
  */
 void
-CheckPointReplicationSlots(void)
+CheckPointReplicationSlots(bool is_shutdown)
 {
        int                     i;
 
@@ -1594,7 +1594,7 @@ CheckPointReplicationSlots(void)
 
                /* save the slot to disk, locking is handled in 
SaveSlotToPath() */
                sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
-               SaveSlotToPath(s, path, LOG);
+               SaveSlotToPath(s, path, LOG, is_shutdown);
        }
        LWLockRelease(ReplicationSlotAllocationLock);
 }
@@ -1700,7 +1700,7 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 
        /* Write the actual state file. */
        slot->dirty = true;                     /* signal that we really need 
to write */
-       SaveSlotToPath(slot, tmppath, ERROR);
+       SaveSlotToPath(slot, tmppath, ERROR, false);
 
        /* Rename the directory into place. */
        if (rename(tmppath, path) != 0)
@@ -1726,7 +1726,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
  * Shared functionality between saving and creating a replication slot.
  */
 static void
-SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
+SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel,
+                          bool is_shutdown)
 {
        char            tmppath[MAXPGPATH];
        char            path[MAXPGPATH];
@@ -1740,8 +1741,13 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, 
int elevel)
        slot->just_dirtied = false;
        SpinLockRelease(&slot->mutex);
 
-       /* and don't do anything if there's nothing to write */
-       if (!was_dirty)
+       /*
+        * and don't do anything if there's nothing to write, unless it's this 
is
+        * called for a logical slot during a shutdown checkpoint, as we want to
+        * persist the confirmed_flush_lsn in that case, even if that's the only
+        * modification.
+        */
+       if (!was_dirty && !is_shutdown && !SlotIsLogical(slot))
                return;
 
        LWLockAcquire(&slot->io_in_progress_lock, LW_EXCLUSIVE);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8a89dc784..7ca37c9f70 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -241,7 +241,7 @@ extern void ReplicationSlotNameForTablesync(Oid suboid, Oid 
relid, char *syncslo
 extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char 
*slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
-extern void CheckPointReplicationSlots(void);
+extern void CheckPointReplicationSlots(bool is_shutdown);
 
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
-- 
2.37.0

Re: [PoC] pg_upgrade: allow to upgrade publisher node

Reply via email to