This is a frustrating thread, because despite the last patch solving most of the problems we discussed, it doesn't address the low-level-backup procedure in a nice way. We'd have to tell users they have to flock that file, or add a new step "pg_controldata --raw > pg_control", which seems weird when they already have a connection to the server.
Maybe it just doesn't matter if eg the pg_controldata program can spuriously fail if pointed at a running system, and I was being too dogmatic trying to fix even that. Maybe we should just focus on fixing backups. Even there, I am beginning to suspect we are solving this problem in the wrong place when a higher level change could simplify the problem away. Idea for future research: Perhaps pg_backup_stop()'s label-file output should include the control file image (suitably encoded)? Then the recovery-from-label code could completely ignore the existing control file, and overwrite it using that copy. It's already partially ignoring it, by using the label file's checkpoint LSN instead of the control file's. Perhaps the captured copy could include the correct LSN already, simplifying that code, and the low level backup procedure would not need any additional steps or caveats. No more atomicity problem for low-level-backups... but probably not something we would back-patch, for such a rare failure mode. Here's a new minimal patch that solves only the bugs in basebackup + the simple SQL-facing functions that read the control file, by simply acquiring ControlFileLock in the obvious places. This should be simple enough for back-patching? Perhaps we could use the control file image from server memory, but that requires us to be certain that its CRC is always up to date. That seems to be true, but I didn't want to require it for this, and it doesn't seem important for non-performance-critical code. Thoughts? As for the other topics that came up in this thread, I kicked the wal_sync_method thing out to its own thread[1]. (There was a logical chain connecting these topics: "can I add file lock system calls here?" -> "well if someone is going to complain that it's performance critical then why are we using unnecessarily slow pg_fsync()?" -> "well if we change that to pg_fdatasync() we have to address known weakness/kludge on macOS first". I don't like the flock stuff anymore, but I do want to fix the known macOS problem independently. Hereby disentangled.) [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BF0EL4Up6yVYbbcWse4xKaqW4wc2xpw67Pq9FjmByWVg%40mail.gmail.com
From ffc679fa4f4261cf9050571bb486575b2e4995f3 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Tue, 31 Jan 2023 20:01:49 +1300 Subject: [PATCH v4 1/2] Acquire ControlFileLock in base backups. The control file could previously be overwritten while a base backup was reading it. On some file systems with weak interlocking, including ext4 and ntfs, the concurrent read could see a mixture of old and new contents. The resulting copy would be corrupted, and the new cluster would not be able to start up. Other files are read without interlocking, but those will be corrected by replaying the log. Exclude this possibility by acquiring ControlFileLock. Copy into a temporary buffer first because we don't want to hold the lock while in sendFile() code that might sleep if throttling is configured. This requires a minor adjustment to sendFileWithContent() to support binary data. This does not address the same possibility when using using external file system copy tools (as described in the documentation under "Making a Base Backup Using the Low Level API"). Back-patch to all supported releases. Reviewed-by: Anton A. Melnikov <aamelni...@inbox.ru> (earlier version) Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de --- src/backend/backup/basebackup.c | 36 ++++++++++++++++++--------------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c index 45be21131c..f03496665f 100644 --- a/src/backend/backup/basebackup.c +++ b/src/backend/backup/basebackup.c @@ -24,6 +24,7 @@ #include "backup/basebackup_target.h" #include "commands/defrem.h" #include "common/compression.h" +#include "common/controldata_utils.h" #include "common/file_perm.h" #include "lib/stringinfo.h" #include "miscadmin.h" @@ -39,6 +40,7 @@ #include "storage/dsm_impl.h" #include "storage/fd.h" #include "storage/ipc.h" +#include "storage/lwlock.h" #include "storage/reinit.h" #include "utils/builtins.h" #include "utils/guc.h" @@ -84,7 +86,7 @@ static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfile struct stat *statbuf, bool missing_ok, Oid dboid, backup_manifest_info *manifest, const char *spcoid); static void sendFileWithContent(bbsink *sink, const char *filename, - const char *content, + const char *content, int len, backup_manifest_info *manifest); static int64 _tarWriteHeader(bbsink *sink, const char *filename, const char *linktarget, struct stat *statbuf, @@ -315,7 +317,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink) if (ti->path == NULL) { - struct stat statbuf; + ControlFileData *control_file; + bool crc_ok; bool sendtblspclinks = true; char *backup_label; @@ -324,14 +327,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink) /* In the main tar, include the backup_label first... */ backup_label = build_backup_content(backup_state, false); sendFileWithContent(sink, BACKUP_LABEL_FILE, - backup_label, &manifest); + backup_label, -1, &manifest); pfree(backup_label); /* Then the tablespace_map file, if required... */ if (opt->sendtblspcmapfile) { sendFileWithContent(sink, TABLESPACE_MAP, - tablespace_map->data, &manifest); + tablespace_map->data, -1, &manifest); sendtblspclinks = false; } @@ -340,13 +343,13 @@ perform_base_backup(basebackup_options *opt, bbsink *sink) sendtblspclinks, &manifest, NULL); /* ... and pg_control after everything else. */ - if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not stat file \"%s\": %m", - XLOG_CONTROL_FILE))); - sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf, - false, InvalidOid, &manifest, NULL); + LWLockAcquire(ControlFileLock, LW_SHARED); + control_file = get_controlfile(DataDir, &crc_ok); + LWLockRelease(ControlFileLock); + sendFileWithContent(sink, XLOG_CONTROL_FILE, + (char *) control_file, sizeof(*control_file), + &manifest); + pfree(control_file); } else { @@ -591,7 +594,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink) * complete segment. */ StatusFilePath(pathbuf, walFileName, ".done"); - sendFileWithContent(sink, pathbuf, "", &manifest); + sendFileWithContent(sink, pathbuf, "", -1, &manifest); } /* @@ -619,7 +622,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink) /* unconditionally mark file as archived */ StatusFilePath(pathbuf, fname, ".done"); - sendFileWithContent(sink, pathbuf, "", &manifest); + sendFileWithContent(sink, pathbuf, "", -1, &manifest); } /* Properly terminate the tar file. */ @@ -1030,18 +1033,19 @@ SendBaseBackup(BaseBackupCmd *cmd) */ static void sendFileWithContent(bbsink *sink, const char *filename, const char *content, + int len, backup_manifest_info *manifest) { struct stat statbuf; - int bytes_done = 0, - len; + int bytes_done = 0; pg_checksum_context checksum_ctx; if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0) elog(ERROR, "could not initialize checksum of file \"%s\"", filename); - len = strlen(content); + if (len < 0) + len = strlen(content); /* * Construct a stat struct for the backup_label file we're injecting in -- 2.41.0
From 7dd75dc8098f2981c929245ec75ff6e2931f3b71 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sat, 22 Jul 2023 11:50:27 +1200 Subject: [PATCH v4 2/2] Acquire ControlFileLock in SQL functions. Commit dc7d70ea added functions that read the control file, but didn't acquire ControlFileLock. With unlucky timing and on certain file systems, they could read corrupted data that is in the process of being overwritten, and the checksum would fail. Back-patch to all supported releases. Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de --- src/backend/utils/misc/pg_controldata.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c index f2c1084797..a1003a464d 100644 --- a/src/backend/utils/misc/pg_controldata.c +++ b/src/backend/utils/misc/pg_controldata.c @@ -24,6 +24,7 @@ #include "common/controldata_utils.h" #include "funcapi.h" #include "miscadmin.h" +#include "storage/lwlock.h" #include "utils/builtins.h" #include "utils/pg_lsn.h" #include "utils/timestamp.h" @@ -42,7 +43,9 @@ pg_control_system(PG_FUNCTION_ARGS) elog(ERROR, "return type must be a row type"); /* read the control file */ + LWLockAcquire(ControlFileLock, LW_SHARED); ControlFile = get_controlfile(DataDir, &crc_ok); + LWLockRelease(ControlFileLock); if (!crc_ok) ereport(ERROR, (errmsg("calculated CRC checksum does not match value stored in file"))); @@ -80,7 +83,9 @@ pg_control_checkpoint(PG_FUNCTION_ARGS) elog(ERROR, "return type must be a row type"); /* Read the control file. */ + LWLockAcquire(ControlFileLock, LW_SHARED); ControlFile = get_controlfile(DataDir, &crc_ok); + LWLockRelease(ControlFileLock); if (!crc_ok) ereport(ERROR, (errmsg("calculated CRC checksum does not match value stored in file"))); @@ -169,7 +174,9 @@ pg_control_recovery(PG_FUNCTION_ARGS) elog(ERROR, "return type must be a row type"); /* read the control file */ + LWLockAcquire(ControlFileLock, LW_SHARED); ControlFile = get_controlfile(DataDir, &crc_ok); + LWLockRelease(ControlFileLock); if (!crc_ok) ereport(ERROR, (errmsg("calculated CRC checksum does not match value stored in file"))); @@ -208,7 +215,9 @@ pg_control_init(PG_FUNCTION_ARGS) elog(ERROR, "return type must be a row type"); /* read the control file */ + LWLockAcquire(ControlFileLock, LW_SHARED); ControlFile = get_controlfile(DataDir, &crc_ok); + LWLockRelease(ControlFileLock); if (!crc_ok) ereport(ERROR, (errmsg("calculated CRC checksum does not match value stored in file"))); -- 2.41.0