Re: .ready and .done files considered harmful

Dipesh Pandit Mon, 02 Aug 2021 06:06:33 -0700

Hi,

> I think what you are saying is true before v14, but not in v14 and master.
Yes, we can use archiver specific shared memory. Thanks.


> I don't think it's great that we're using up SIGINT for this purpose.
> There aren't that many signals available at the O/S level that we can
> use for our purposes, and we generally try to multiplex them at the
> application layer, e.g. by setting a latch or a flag in shared memory,
> rather than using a separate signal. Can we do something of that sort
> here? Or maybe we don't even need a signal. ThisTimeLineID is already
> visible in shared memory, so why not just have the archiver just check
> and see whether it's changed, say via a new accessor function
> GetCurrentTimeLineID()? I guess there could be a concern about the
> expensive of that, because we'd probably be taking a spinlock or an
> lwlock for every cycle, but I don't think it's probably that bad,
> because I doubt we can archive much more than a double-digit number of
> files per second even with a very fast archive_command, and contention
> on a lock generally requires a five digit number of acquisitions per
> second. It would be worth testing to see if we can see a problem here,
> but I'm fairly hopeful that it's not an issue. If we do feel that it's
> important to avoid repeatedly taking a lock, let's see if we can find
> a way to do it without dedicating a signal to this purpose.

We can maintain the current timeline ID in archiver specific shared memory.
If we switch to a new timeline then the backend process can update the new
timeline ID in shared memory. Archiver can keep a track of current timeline
ID
and if it finds that there is a timeline switch then it can perform a full
directory
scan to make sure that archiving history files takes precedence over WAL
files.
Access to the shared memory area can be protected by adding a
WALArchiverLock.
If we take this approach then it doesn't require to use a dedicated signal
to notify
a timeline switch.

> The problem with all this is that you can't understand either function
> in isolation. Unless you read them both together and look at all of
> the ways these three variables are manipulated, you can't really
> understand the logic. And there's really no reason why that needs to
> be true. The job of cleaning timeline_switch and setting dirScan could
> be done entirely within pgarch_readyXlog(), and so could the job of
> incrementing nextLogSegNo, because we're not going to again call
> pgarch_readyXlog() unless archiving succeeded.

> Also note that the TLI which is stored in curFileTLI corresponds to
> the segment number stored in nextLogSegNo, yet one of them has "cur"
> for "current" in the name and the other has "next". It would be easier
> to read the code if the names were chosen more consistently.

> My tentative idea as to how to clean this up is: declare a new struct
> with a name like readyXlogState and members lastTLI and lastSegNo.
> Have pgarch_ArchiverCopyLoop() declare a variable of this type, zero
> it, pass it as a parameter to pgarch_readyXlog(), and otherwise leave
> it alone. Then let pgarch_readyXlog() do all of the manipulation of
> the values stored therein.

Make sense, we can move the entire logic to a single function
pgarch_readyXlog()
and declare a new struct readyXLogState.

I think we cannot declare a variable of this type in
pgarch_ArchiverCopyLoop()
due to the fact that this function will be called every time the archiver
wakes up.
Initializing readyXLogState here will reset the next anticipated log
segment number
when the archiver wakes up from a wait state. We can declare and initialize
it in
pgarch_MainLoop() to avoid resetting the next anticipated log segment
number
when the archiver wakes up.

> You've moved this comment from its original location, but the trouble
> is that the comment is 100% false. In fact, the whole reason why you
> wrote this patch is *because* this comment is 100% false. In fact it
> is not difficult to create cases where each scan finds many files, and
> the purpose of the patch is precisely to optimize the code that the
> person who wrote this thought didn't need optimizing. Now it may take
> some work to figure out what we want to say here exactly, but
> preserving the comment as it's written here is certainly misleading.

Yes, I agree. We can update the comments here to list the scenarios
where we may need to perform a full directory scan.

I have incorporated these changes and updated a new patch. Please find
the attached patch v5.

Thanks,
Dipesh

From 3e0f690d1b8fd1d07fa503a807ee97cb9ed7377b Mon Sep 17 00:00:00 2001
From: Dipesh Pandit <dipesh.pan...@enterprisedb.com>
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archived. This directory scan can be minimised by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

If there is a timeline switch then archiver performs a full directory
scan to make sure that archiving history file takes precedence over
archiving WAL files on older timeline.
---
 src/backend/access/transam/xlog.c        |  16 +++
 src/backend/postmaster/pgarch.c          | 177 ++++++++++++++++++++++++++++---
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/postmaster/pgarch.h          |   3 +
 4 files changed, 182 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3479402..3c52596 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -50,6 +50,7 @@
 #include "port/atomics.h"
 #include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/pgarch.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/basebackup.h"
@@ -6994,6 +6995,10 @@ StartupXLOG(void)
 	 */
 	ThisTimeLineID = checkPoint.ThisTimeLineID;
 
+	/* Initialize current timeline ID at archiver */
+	if (XLogArchivingActive())
+		PgArchSetCurrentTLI(ThisTimeLineID);
+
 	/*
 	 * Copy any missing timeline history files between 'now' and the recovery
 	 * target timeline from archive to pg_wal. While we don't need those files
@@ -7552,6 +7557,10 @@ StartupXLOG(void)
 					 */
 					if (AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Update the new timeline at archiver */
+					if (XLogArchivingActive())
+						PgArchSetCurrentTLI(ThisTimeLineID);
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7829,6 +7838,13 @@ StartupXLOG(void)
 		/* Get rid of any remaining recovered timeline-history file, too */
 		snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYHISTORY");
 		unlink(recoveryPath);	/* ignore any error */
+
+		/*
+		 * Switched to a new timeline, we need to update the current timeline
+		 * at archiver.
+		 */
+		if (XLogArchivingActive())
+			PgArchSetCurrentTLI(ThisTimeLineID);
 	}
 
 	/* Save the selected TimeLineID in shared memory, too */
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..8e93a8e 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -33,7 +33,6 @@
 #include <sys/wait.h>
 #include <unistd.h>
 
-#include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
@@ -72,12 +71,23 @@
  */
 #define NUM_ORPHAN_CLEANUP_RETRIES 3
 
-/* Shared memory area for archiver process */
+/* Shared memory area for archiver process. WALArchiverLock must be held to
+ * read/write currentTLI.
+ */
 typedef struct PgArchData
 {
+	TimeLineID	currentTLI;		/* current timeline ID */
 	int			pgprocno;		/* pgprocno of archiver process */
 } PgArchData;
 
+/*
+ * Segment number and timeline ID to identify the next file in a WAL sequence
+ */
+typedef struct readyXLogState
+{
+	XLogSegNo	lastSegNo;
+	TimeLineID	lastTLI;
+} readyXLogState;
 
 /* ----------
  * Local data
@@ -97,12 +107,13 @@ static volatile sig_atomic_t ready_to_stop = false;
  */
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
-static void pgarch_ArchiverCopyLoop(void);
+static void pgarch_ArchiverCopyLoop(readyXLogState *xlogState);
 static bool pgarch_archiveXlog(char *xlog);
-static bool pgarch_readyXlog(char *xlog);
+static bool pgarch_readyXlog(char *xlog, readyXLogState *xlogState);
 static void pgarch_archiveDone(char *xlog);
 static void pgarch_die(int code, Datum arg);
 static void HandlePgArchInterrupts(void);
+static TimeLineID PgArchGetCurrentTLI();
 
 /* Report shared memory space needed by PgArchShmemInit */
 Size
@@ -221,6 +232,42 @@ PgArchWakeup(void)
 		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
 }
 
+/*
+ * Called by backend process to set the current timeline ID in shared memory.
+ */
+void
+PgArchSetCurrentTLI(TimeLineID tli)
+{
+	if (IsUnderPostmaster)
+	{
+		/* Needs to protect the access using WALArchiverLock */
+		LWLockAcquire(WALArchiverLock, LW_EXCLUSIVE);
+		PgArch->currentTLI = tli;
+		LWLockRelease(WALArchiverLock);
+	}
+}
+
+/*
+ * Get the current timeline ID from shared memory.
+ */
+static TimeLineID
+PgArchGetCurrentTLI()
+{
+	TimeLineID			tli;
+
+	/*
+	 * Taking a light weight lock for each cycle, contention on a light weight
+	 * lock generally requires five digit number of acquisitions per second and
+	 * the number of files that we can archive with the fastest archive command
+	 * is not more than a two digit number per second.
+	 */
+	LWLockAcquire(WALArchiverLock, LW_EXCLUSIVE);
+	tli = PgArch->currentTLI;
+	LWLockRelease(WALArchiverLock);
+
+	return tli;
+}
+
 
 /* SIGUSR2 signal handler for archiver process */
 static void
@@ -243,10 +290,18 @@ pgarch_waken_stop(SIGNAL_ARGS)
 static void
 pgarch_MainLoop(void)
 {
+	readyXLogState xlogState;
 	pg_time_t	last_copy_time = 0;
 	bool		time_to_stop;
 
 	/*
+	 * Initialize xlogState, segment number and TLI will be reset/updated in
+	 * function pgarch_readyXlog() for each cycle.
+	 */
+	xlogState.lastSegNo = 0;
+	xlogState.lastTLI = 0;
+
+	/*
 	 * There shouldn't be anything for the archiver to do except to wait for a
 	 * signal ... however, the archiver exists to protect our data, so she
 	 * wakes up occasionally to allow herself to be proactive.
@@ -280,7 +335,7 @@ pgarch_MainLoop(void)
 		}
 
 		/* Do what we're here for */
-		pgarch_ArchiverCopyLoop();
+		pgarch_ArchiverCopyLoop(&xlogState);
 		last_copy_time = time(NULL);
 
 		/*
@@ -321,7 +376,7 @@ pgarch_MainLoop(void)
  * Archives all outstanding xlogs then returns
  */
 static void
-pgarch_ArchiverCopyLoop(void)
+pgarch_ArchiverCopyLoop(readyXLogState *xlogState)
 {
 	char		xlog[MAX_XFN_CHARS + 1];
 
@@ -331,7 +386,7 @@ pgarch_ArchiverCopyLoop(void)
 	 * some backend will add files onto the list of those that need archiving
 	 * while we are still copying earlier archives
 	 */
-	while (pgarch_readyXlog(xlog))
+	while (pgarch_readyXlog(xlog, xlogState))
 	{
 		int			failures = 0;
 		int			failures_orphan = 0;
@@ -596,29 +651,92 @@ pgarch_archiveXlog(char *xlog)
  * larger ID; the net result being that past timelines are given higher
  * priority for archiving.  This seems okay, or at least not obviously worth
  * changing.
+ *
+ * WAL files are generated in a specific order of log segment number. The
+ * directory scan for each WAL file can be minimised by identifying the next
+ * WAL file in the sequence. This can be achieved by maintaining log segment
+ * number and timeline ID corresponding to WAL file currently being archived.
+ * The log segment number of current WAL file can be incremented by '1' to
+ * point to the next WAL file in a sequence. Full directory scan can be avoided
+ * by checking the availability of next WAL file. "xlogState" specifies the
+ * segment number and timeline ID corresponding to the next WAL file.
+ *
+ * However, a full directory scan is performed in case if there is a timeline
+ * switch to make sure that archiving history file takes precedence over
+ * archiving WAL files from older timeline. Archiver identifies a timeline
+ * switch if the timeline ID tracked at archiver does not match with the
+ * current timeline ID.
+ *
+ * Returns "true" if a segment is ready for archival, "xlog" represents the
+ * name of the segment.
  */
 static bool
-pgarch_readyXlog(char *xlog)
+pgarch_readyXlog(char *xlog, readyXLogState *xlogState)
 {
-	/*
-	 * open xlog status directory and read through list of xlogs that have the
-	 * .ready suffix, looking for earliest file. It is possible to optimise
-	 * this code, though only a single file is expected on the vast majority
-	 * of calls, so....
-	 */
+	char		basename[MAX_XFN_CHARS + 1];
+	char		xlogready[MAXPGPATH];
 	char		XLogArchiveStatusDir[MAXPGPATH];
 	DIR		   *rldir;
 	struct dirent *rlde;
+	struct stat	st;
+	static TimeLineID archiverTLI = 0;	/* Timeline ID tacked at archiver */
 	bool		found = false;
 	bool		historyFound = false;
 
+	/*
+	 * As long as the timeline ID tracked at archiver matches with current
+	 * timeline ID, check the next WAL file in a sequence.
+	 */
+	if (archiverTLI == PgArchGetCurrentTLI())
+	{
+		/*
+		 * We already have the next anticipated log segment and timeline, check
+		 * if this WAL is ready to be archived. If yes, skip the directory
+		 * scan.
+		 */
+		XLogFileName(basename, xlogState->lastTLI, xlogState->lastSegNo, wal_segment_size);
+		StatusFilePath(xlogready, basename, ".ready");
+
+		if (stat(xlogready, &st) == 0)
+		{
+			strcpy(xlog, basename);
+
+			/*
+			 * Increment the lastSegNo in readyXLogState to point to the next
+			 * WAL file. Although we have not yet archived the current WAL file
+			 * and readyXLogState points to the next WAL file, this is safe
+			 * becuase the next cycle will not begin until we finish archiving
+			 * current WAL file.
+			 */
+			if (!IsTLHistoryFileName(xlog))
+				xlogState->lastSegNo++;
+			return true;
+		}
+	}
+
+	/*
+	 * Perform a full directory scan to identify the next log segment. There
+	 * may be one of the following scenarios which may require us to perform a
+	 * full directory scan.
+	 *
+	 * 1. This is the first cycle since archiver has started and there is no
+	 * idea about the next anticipated log segment.
+	 *
+	 * 2. There is a timeline switch, i.e. the timeline ID tracked at archiver
+	 * does not match with current timeline ID. Archive history file as part of
+	 * this timeline switch.
+	 *
+	 * 3. The next anticipated log segment is not available.
+	 *
+	 * open xlog status directory and read through list of xlogs that have the
+	 * .ready suffix, looking for earliest file.
+	 */
 	snprintf(XLogArchiveStatusDir, MAXPGPATH, XLOGDIR "/archive_status");
 	rldir = AllocateDir(XLogArchiveStatusDir);
 
 	while ((rlde = ReadDir(rldir, XLogArchiveStatusDir)) != NULL)
 	{
 		int			basenamelen = (int) strlen(rlde->d_name) - 6;
-		char		basename[MAX_XFN_CHARS + 1];
 		bool		ishistory;
 
 		/* Ignore entries with unexpected number of characters */
@@ -661,6 +779,35 @@ pgarch_readyXlog(char *xlog)
 				strcpy(xlog, basename);
 		}
 	}
+
+	if (found)
+	{
+		/* Update the timeline ID tracked at archiver */
+		if (historyFound)
+			sscanf(xlog, "%08X.history", &archiverTLI);
+		else
+		{
+			/*
+			 * Reset segment number and timeline ID as this is the beginning of a
+			 * new sequence.
+			 */
+			XLogFromFileName(xlog, &xlogState->lastTLI, &xlogState->lastSegNo,
+					wal_segment_size);
+
+			/* If this is the first cycle, update timeline ID tracked at archiver */
+			archiverTLI = (archiverTLI == 0) ?
+							 xlogState->lastTLI : archiverTLI;
+
+			/* Increment log segment number to point to the next WAL file */
+			xlogState->lastSegNo++;
+
+		}
+
+		ereport(LOG,
+				(errmsg("directory scan to archive write-ahead log file \"%s\"",
+						xlog)));
+	}
+
 	FreeDir(rldir);
 
 	return found;
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 6c7cf6c..61a3735 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock					44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock				46
 NotifyQueueTailLock					47
+WALArchiverLock						48
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 1e47a14..8e04acc 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -13,6 +13,8 @@
 #ifndef _PGARCH_H
 #define _PGARCH_H
 
+#include "access/xlog.h"
+
 /* ----------
  * Archiver control info.
  *
@@ -31,5 +33,6 @@ extern void PgArchShmemInit(void);
 extern bool PgArchCanRestart(void);
 extern void PgArchiverMain(void) pg_attribute_noreturn();
 extern void PgArchWakeup(void);
+extern void PgArchSetCurrentTLI(TimeLineID tli);
 
 #endif							/* _PGARCH_H */
-- 
1.8.3.1

Re: .ready and .done files considered harmful

Reply via email to