This is a frustrating thread, because despite the last patch solving
most of the problems we discussed, it doesn't address the
low-level-backup procedure in a nice way.  We'd have to tell users
they have to flock that file, or add a new step "pg_controldata --raw
> pg_control", which seems weird when they already have a connection
to the server.

Maybe it just doesn't matter if eg the pg_controldata program can
spuriously fail if pointed at a running system, and I was being too
dogmatic trying to fix even that.  Maybe we should just focus on
fixing backups.  Even there, I am beginning to suspect we are solving
this problem in the wrong place when a higher level change could
simplify the problem away.

Idea for future research:  Perhaps pg_backup_stop()'s label-file
output should include the control file image (suitably encoded)?  Then
the recovery-from-label code could completely ignore the existing
control file, and overwrite it using that copy.  It's already
partially ignoring it, by using the label file's checkpoint LSN
instead of the control file's.  Perhaps the captured copy could
include the correct LSN already, simplifying that code, and the low
level backup procedure would not need any additional steps or caveats.
No more atomicity problem for low-level-backups... but probably not
something we would back-patch, for such a rare failure mode.

Here's a new minimal patch that solves only the bugs in basebackup +
the simple SQL-facing functions that read the control file, by simply
acquiring ControlFileLock in the obvious places.  This should be
simple enough for back-patching?

Perhaps we could use the control file image from server memory, but
that requires us to be certain that its CRC is always up to date.
That seems to be true, but I didn't want to require it for this, and
it doesn't seem important for non-performance-critical code.

Thoughts?

As for the other topics that came up in this thread, I kicked the
wal_sync_method thing out to its own thread[1].  (There was a logical
chain connecting these topics: "can I add file lock system calls
here?" -> "well if someone is going to complain that it's performance
critical then why are we using unnecessarily slow pg_fsync()?" ->
"well if we change that to pg_fdatasync() we have to address known
weakness/kludge on macOS first".  I don't like the flock stuff
anymore, but I do want to fix the known macOS problem independently.
Hereby disentangled.)

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BF0EL4Up6yVYbbcWse4xKaqW4wc2xpw67Pq9FjmByWVg%40mail.gmail.com
From ffc679fa4f4261cf9050571bb486575b2e4995f3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 31 Jan 2023 20:01:49 +1300
Subject: [PATCH v4 1/2] Acquire ControlFileLock in base backups.

The control file could previously be overwritten while a base backup was
reading it.  On some file systems with weak interlocking, including ext4
and ntfs, the concurrent read could see a mixture of old and new
contents.  The resulting copy would be corrupted, and the new cluster
would not be able to start up.  Other files are read without
interlocking, but those will be corrected by replaying the log.

Exclude this possibility by acquiring ControlFileLock.  Copy into a
temporary buffer first because we don't want to hold the lock while in
sendFile() code that might sleep if throttling is configured.  This
requires a minor adjustment to sendFileWithContent() to support binary
data.

This does not address the same possibility when using using external
file system copy tools (as described in the documentation under "Making
a Base Backup Using the Low Level API").

Back-patch to all supported releases.

Reviewed-by: Anton A. Melnikov <aamelni...@inbox.ru> (earlier version)
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/backup/basebackup.c | 36 ++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 45be21131c..f03496665f 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -24,6 +24,7 @@
 #include "backup/basebackup_target.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
+#include "common/controldata_utils.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
@@ -39,6 +40,7 @@
 #include "storage/dsm_impl.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/reinit.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
@@ -84,7 +86,7 @@ static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfile
 					 struct stat *statbuf, bool missing_ok, Oid dboid,
 					 backup_manifest_info *manifest, const char *spcoid);
 static void sendFileWithContent(bbsink *sink, const char *filename,
-								const char *content,
+								const char *content, int len,
 								backup_manifest_info *manifest);
 static int64 _tarWriteHeader(bbsink *sink, const char *filename,
 							 const char *linktarget, struct stat *statbuf,
@@ -315,7 +317,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
+				ControlFileData *control_file;
+				bool		crc_ok;
 				bool		sendtblspclinks = true;
 				char	   *backup_label;
 
@@ -324,14 +327,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 				/* In the main tar, include the backup_label first... */
 				backup_label = build_backup_content(backup_state, false);
 				sendFileWithContent(sink, BACKUP_LABEL_FILE,
-									backup_label, &manifest);
+									backup_label, -1, &manifest);
 				pfree(backup_label);
 
 				/* Then the tablespace_map file, if required... */
 				if (opt->sendtblspcmapfile)
 				{
 					sendFileWithContent(sink, TABLESPACE_MAP,
-										tablespace_map->data, &manifest);
+										tablespace_map->data, -1, &manifest);
 					sendtblspclinks = false;
 				}
 
@@ -340,13 +343,13 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 						sendtblspclinks, &manifest, NULL);
 
 				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, &manifest, NULL);
+				LWLockAcquire(ControlFileLock, LW_SHARED);
+				control_file = get_controlfile(DataDir, &crc_ok);
+				LWLockRelease(ControlFileLock);
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+									(char *) control_file, sizeof(*control_file),
+									&manifest);
+				pfree(control_file);
 			}
 			else
 			{
@@ -591,7 +594,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 			 * complete segment.
 			 */
 			StatusFilePath(pathbuf, walFileName, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/*
@@ -619,7 +622,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			/* unconditionally mark file as archived */
 			StatusFilePath(pathbuf, fname, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/* Properly terminate the tar file. */
@@ -1030,18 +1033,19 @@ SendBaseBackup(BaseBackupCmd *cmd)
  */
 static void
 sendFileWithContent(bbsink *sink, const char *filename, const char *content,
+					int len,
 					backup_manifest_info *manifest)
 {
 	struct stat statbuf;
-	int			bytes_done = 0,
-				len;
+	int			bytes_done = 0;
 	pg_checksum_context checksum_ctx;
 
 	if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
 		elog(ERROR, "could not initialize checksum of file \"%s\"",
 			 filename);
 
-	len = strlen(content);
+	if (len < 0)
+		len = strlen(content);
 
 	/*
 	 * Construct a stat struct for the backup_label file we're injecting in
-- 
2.41.0

From 7dd75dc8098f2981c929245ec75ff6e2931f3b71 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sat, 22 Jul 2023 11:50:27 +1200
Subject: [PATCH v4 2/2] Acquire ControlFileLock in SQL functions.

Commit dc7d70ea added functions that read the control file, but didn't
acquire ControlFileLock.  With unlucky timing and on certain file
systems, they could read corrupted data that is in the process of being
overwritten, and the checksum would fail.

Back-patch to all supported releases.

Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/utils/misc/pg_controldata.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index f2c1084797..a1003a464d 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -24,6 +24,7 @@
 #include "common/controldata_utils.h"
 #include "funcapi.h"
 #include "miscadmin.h"
+#include "storage/lwlock.h"
 #include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
@@ -42,7 +43,9 @@ pg_control_system(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* read the control file */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
@@ -80,7 +83,9 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* Read the control file. */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
@@ -169,7 +174,9 @@ pg_control_recovery(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* read the control file */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
@@ -208,7 +215,9 @@ pg_control_init(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	/* read the control file */
+	LWLockAcquire(ControlFileLock, LW_SHARED);
 	ControlFile = get_controlfile(DataDir, &crc_ok);
+	LWLockRelease(ControlFileLock);
 	if (!crc_ok)
 		ereport(ERROR,
 				(errmsg("calculated CRC checksum does not match value stored in file")));
-- 
2.41.0

Reply via email to