On Tue, Jan 31, 2023 at 5:09 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> Clearly there is an element of speculation or superstition here.  I
> don't know what else to do if both PostgreSQL and ext4 decided not to
> add interlocking.  Maybe we should rethink that.  How bad would it
> really be if control file access used POSIX file locking?  I mean, the
> writer is going to *fsync* the file, so it's not like one more wafer
> thin system call is going to hurt too much.

Here's an experimental patch for that alternative.  I wonder if
someone would want to be able to turn it off for some reason -- maybe
some NFS problem?  It's less back-patchable, but maybe more
principled?

I don't know if Windows suffers from this type of problem.
Unfortunately its equivalent functionality LockFile() looks non-ideal
for this purpose: if your program crashes, the documentation is very
vague on when exactly it is released by the OS, but it's not
immediately on process exit.  That seems non-ideal for a control file
you might want to access again very soon after a crash, to be able to
recover.

A thought in passing: if UpdateMinRecoveryPoint() performance is an
issue, maybe we should figure out how to use fdatasync() instead of
fsync().
From fe71350b0874adbec97c612e7b80e7179c6c48e9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 31 Jan 2023 20:01:49 +1300
Subject: [PATCH 1/2] Lock pg_control while reading or writing.

Front-end programs that read pg_control and user-facing functions run
in the backend read pg_control without acquiring ControlFileLock.  If
you're unlucky enough to read() while the backend is in write(), on at
least Linux ext4 you might see partial data.  Use a POSIX advisory lock
to avoid this problem.
---
 src/common/controldata_utils.c | 69 ++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 9723587466..284895bdd9 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -39,6 +39,42 @@
 #include "storage/fd.h"
 #endif
 
+/*
+ * Advisory-lock the control file until closed.
+ */
+static int
+lock_file(int fd, bool exclusive)
+{
+#ifdef WIN32
+	/*
+	 * LockFile() might work, but it seems dangerous because there are reports
+	 * that the lock can sometimes linger if a program crashes without
+	 * unlocking.  That would prevent recovery after a PANIC, so we'd better
+	 * not use it without more research.  Because of the lack of portability,
+	 * this function is kept private to controldata_utils.c.
+	 */
+	return 0;
+#else
+	struct flock lock;
+	int		rc;
+
+	memset(&lock, 0, sizeof(lock));
+	lock.l_type = exclusive ? F_WRLCK : F_RDLCK;
+	lock.l_start = 0;
+	lock.l_whence = SEEK_SET;
+	lock.l_len = 0;
+	lock.l_pid = -1;
+
+	do
+	{
+		rc = fcntl(fd, F_SETLKW, &lock);
+	}
+	while (rc < 0 && errno == EINTR);
+
+	return rc;
+#endif
+}
+
 /*
  * get_controlfile()
  *
@@ -74,6 +110,20 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 				 ControlFilePath);
 #endif
 
+	/* Make sure we can read the file atomically. */
+	if (lock_file(fd, false) < 0)
+	{
+#ifndef FRONTEND
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not lock file \"%s\" for reading: %m",
+						ControlFilePath)));
+#else
+		pg_fatal("could not lock file \"%s\" for reading: %m",
+				 ControlFilePath);
+#endif
+	}
+
 	r = read(fd, ControlFile, sizeof(ControlFileData));
 	if (r != sizeof(ControlFileData))
 	{
@@ -186,6 +236,25 @@ update_controlfile(const char *DataDir,
 		pg_fatal("could not open file \"%s\": %m", ControlFilePath);
 #endif
 
+	/*
+	 * Make sure that any concurrent reader (including frontend programs) can
+	 * read the file atomically.  Note that this refers to atomicity of
+	 * concurrent reads and writes.  For our assumption of atomicity under
+	 * power failure, see PG_CONTROL_MAX_SAFE_SIZE.
+	 */
+	if (lock_file(fd, true) < 0)
+	{
+#ifndef FRONTEND
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not lock file \"%s\" for writing: %m",
+						ControlFilePath)));
+#else
+		pg_fatal("could not lock file \"%s\" for writing: %m",
+				 ControlFilePath);
+#endif
+	}
+
 	errno = 0;
 #ifndef FRONTEND
 	pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE);
-- 
2.35.1

Reply via email to