On Tue, Feb 14, 2023 at 4:38 PM Anton A. Melnikov <aamelni...@inbox.ru> wrote:
> First of all it seemed to me that is not a problem at all since msdn
> guarantees sector-by-sector atomicity.
> "Physical Sector: The unit for which read and write operations to the device
> are completed in a single operation. This is the unit of atomic write..."
> https://learn.microsoft.com/en-us/windows/win32/fileio/file-buffering
> https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile
> (Of course, only if the 512 bytes lays from the beginning of the file with a 
> zero
> offset, but this is our case. The current size of ControlFileData is
> 296 bytes at offset = 0.)

There are two kinds of atomicity that we rely on for the control file today:

* atomicity on power loss (= device property, in case of overwrite filesystems)
* atomicity of concurrent reads and writes (= VFS or kernel buffer
pool interlocking policy)

I assume that documentation is talking about the first thing (BTW, I
suspect that documentation is also now wrong in one special case: NTFS
filesystems mounted on DAX devices don't have sectors or sector-based
atomicity unless you turn on BTT and slow them down[1]; that might
eventually be something for PostgreSQL to think about, and it applies
to other OSes too).

With this patch we would stop relying on the second thing.  Supposedly
POSIX requires read/write atomicity, and many file systems offer it in
a strong form (internal range locking) or maybe a weak accidental form
(page-level locking).  Since some extremely popular implementations
just don't care, and Windows isn't even a POSIX, we just have to do it
ourselves.

BTW there are at least two other places where PostgreSQL already knows
that concurrent reads and writes are possibly non-atomic (and we also
don't even try to get the alignment right, making the question moot):
pg_basebackup, which enables full_page_writes implicitly if you didn't
have the GUC on already, and pg_rewind, which refuses to run if you
haven't enabled full_page_writes explicitly (as recently discussed on
another thread recently; that's an annoying difference, and also an
annoying behaviour if you know your storage doesn't really need it!)

> I tried to verify this fact experimentally and was very surprised.
> Unfortunately it crashed in an hour during torture test:
> 2023-02-13 15:10:52.675 MSK [5704] LOG:  starting PostgreSQL 16devel, 
> compiled by Visual C++ build 1929, 64-bit
> 2023-02-13 15:10:52.768 MSK [5704] LOG:  database system is ready to accept 
> connections
> @@@@@@ sizeof(ControlFileData) = 296
> .........
> 2023-02-13 16:10:41.997 MSK [9380] ERROR:  calculated CRC checksum does not 
> match value stored in file

Thanks.  I'd seen reports of this in discussions on the 'net, but
those had no authoritative references or supporting test results.  The
fact that fsync made it take longer (in your following email) makes
sense and matches Linux.  It inserts a massive pause in the middle of
the experiment loop, affecting the probabilities.

Therefore, we need a solution for Windows too.  I tried to write the
equivalent code, in the attached.  I learned a few things: Windows
locks are mandatory (that is, if you lock a file, reads/writes can
fail, unlike Unix).  Combined with async release, that means we
probably also need to lock the file in xlog.c, when reading it in
xlog.c:ReadControlFile() (see comments).  Before I added that, on a CI
run, I found that the read in there would sometimes fail, and adding
the locking fixed that.  I am a bit confused, though, because I
expected that to be necessary only if someone managed to crash while
holding the write lock, which the CI tests shouldn't do.  Do you have
any ideas?

While contemplating what else a mandatory file lock might break, I
remembered that basebackup.c also reads the control file.  Hrmph.  Not
addressed yet; I guess it might need to acquire/release around
sendFile(sink, XLOG_CONTROL_FILE, ...)?

> > I would
> > only want to consider this if we also stop choosing "open_datasync" as
> > a default on at least Windows.
>
> I didn't quite understand this point. Could you clarify it for me, please? If 
> the performance
> of UpdateMinRecoveryPoint() wasn't a problem we could just use fsync in all 
> platforms.

The level of durability would be weakened on Windows.  On all systems
except Linux and FreeBSD, we default to wal_sync_method=open_datasync,
so then we would start using FILE_FLAG_WRITE_THROUGH for the control
file too.  We know from reading and experimentation that
FILE_FLAG_WRITE_THROUGH doesn't flush the drive cache on consumer
Windows hardware, but its fdatasync-like thing does[2].  I have not
thought too hard about the consequences of using different durability
levels for different categories of file, but messing with write
ordering can't be good for crash recovery, so I'd rather increase WAL
durability than decrease control file durability.  If a Windows user
complains that it makes their fancy non-volatile cache slow down, they
can always adjust the settings in PostgreSQL, their OS, or their
drivers etc.  I think we should just make fdatasync the default on all
systems.

Here's a patch like that.

[1] 
https://learn.microsoft.com/en-us/windows-server/storage/storage-spaces/persistent-memory-direct-access
[2] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2Ba-7r4GpADsasCnuDBiqC1c31DAQQco2FayVtB9V3sQw%40mail.gmail.com#460bfa5a6b571cc98c575d23322e0b25
From 816760e52a1233b29674869205ce9283aab28a73 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 31 Jan 2023 20:01:49 +1300
Subject: [PATCH 1/3] Lock pg_control while reading or writing.

Front-end programs that read pg_control and user-facing functions run
in the backend read pg_control without acquiring ControlFileLock.  If
you're unlucky enough to read() while the backend is in write(), on at
least ext4 or NTFS, you might see partial data.  Use a POSIX advisory
file lock or a Windows mandatory file lock, to avoid this problem.

Since the semantics of POSIX and Windows file locks are different
and error-prone, don't try to provide a general purpose reusable
function for those operations; keep them close to the control file code.

Reviewed-by: Anton A. Melnikov <aamelni...@inbox.ru>
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/access/transam/xlog.c      |  21 +++++
 src/common/controldata_utils.c         | 126 +++++++++++++++++++++++++
 src/include/common/controldata_utils.h |   4 +
 3 files changed, 151 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..9b9c3294e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3980,6 +3980,19 @@ ReadControlFile(void)
 				 errmsg("could not open file \"%s\": %m",
 						XLOG_CONTROL_FILE)));
 
+	/*
+	 * Since file locks are released asynchronously after a program crashes on
+	 * Windows, there is a small chance that the write lock of a previously
+	 * running postmaster hasn't been released yet.  We need to wait for that
+	 * here; if we didn't, our read might fail, because Windows file locks are
+	 * mandatory (that is, not advisory).
+	 */
+	if (lock_controlfile(fd, false) < 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not lock file \"%s\": %m",
+						XLOG_CONTROL_FILE)));
+
 	pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_READ);
 	r = read(fd, ControlFile, sizeof(ControlFileData));
 	if (r != sizeof(ControlFileData))
@@ -3997,6 +4010,14 @@ ReadControlFile(void)
 	}
 	pgstat_report_wait_end();
 
+#ifdef WIN32
+	if (unlock_controlfile(fd) < 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not unlock file \"%s\": %m",
+						XLOG_CONTROL_FILE)));
+#endif
+
 	close(fd);
 
 	/*
diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 9723587466..45b43ae546 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -39,6 +39,91 @@
 #include "storage/fd.h"
 #endif
 
+/*
+ * Lock the control file until closed.  This should only be used on file where
+ * only one descriptor is open in a given process (due to weird semantics of
+ * POSIX locks).  On Windows, see unlock_controlfile(), but otherwise the lock
+ * is held until the file is closed.
+ */
+int
+lock_controlfile(int fd, bool exclusive)
+{
+#ifdef WIN32
+	OVERLAPPED	overlapped = {0};
+	HANDLE		handle;
+
+	handle = (HANDLE) _get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE)
+	{
+		errno = EBADF;
+		return -1;
+	}
+
+	/* Lock first byte. */
+	if (!LockFileEx(handle,
+					exclusive ? LOCKFILE_EXCLUSIVE_LOCK : 0,
+					0,
+					1,
+					0,
+					&overlapped))
+	{
+		_dosmaperr(GetLastError());
+		return -1;
+	}
+
+	return 0;
+#else
+	struct flock lock;
+	int			rc;
+
+	memset(&lock, 0, sizeof(lock));
+	lock.l_type = exclusive ? F_WRLCK : F_RDLCK;
+	lock.l_start = 0;
+	lock.l_whence = SEEK_SET;
+	lock.l_len = 0;
+	lock.l_pid = -1;
+
+	do
+	{
+		rc = fcntl(fd, F_SETLKW, &lock);
+	}
+	while (rc < 0 && errno == EINTR);
+
+	return rc;
+#endif
+}
+
+#ifdef WIN32
+/*
+ * Release lock acquire with lock_controlfile().  On POSIX systems, we don't
+ * bother making an extra system call to release the lock, since it'll be
+ * released on close anyway.  On Windows, explicit release is recommended by
+ * the documentation to make sure it is done synchronously.
+ */
+int
+unlock_controlfile(int fd)
+{
+	OVERLAPPED	overlapped = {0};
+	HANDLE		handle;
+
+	handle = (HANDLE) _get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE)
+	{
+		errno = EBADF;
+		return -1;
+	}
+
+	/* Unlock first byte. */
+	if (!UnlockFileEx(handle, 0, 1, 0, &overlapped))
+	{
+		_dosmaperr(GetLastError());
+		return -1;
+	}
+
+	return 0;
+}
+#endif
+
 /*
  * get_controlfile()
  *
@@ -74,6 +159,20 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 				 ControlFilePath);
 #endif
 
+	/* Make sure we can read the file atomically. */
+	if (lock_controlfile(fd, false) < 0)
+	{
+#ifndef FRONTEND
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not lock file \"%s\" for reading: %m",
+						ControlFilePath)));
+#else
+		pg_fatal("could not lock file \"%s\" for reading: %m",
+				 ControlFilePath);
+#endif
+	}
+
 	r = read(fd, ControlFile, sizeof(ControlFileData));
 	if (r != sizeof(ControlFileData))
 	{
@@ -97,6 +196,10 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 #endif
 	}
 
+#ifdef WIN32
+	unlock_controlfile(fd);
+#endif
+
 #ifndef FRONTEND
 	if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
@@ -186,6 +289,25 @@ update_controlfile(const char *DataDir,
 		pg_fatal("could not open file \"%s\": %m", ControlFilePath);
 #endif
 
+	/*
+	 * Make sure that any concurrent reader (including frontend programs) can
+	 * read the file atomically.  Note that this refers to atomicity of
+	 * concurrent reads and writes.  For our assumption of atomicity under
+	 * power failure, see PG_CONTROL_MAX_SAFE_SIZE.
+	 */
+	if (lock_controlfile(fd, true) < 0)
+	{
+#ifndef FRONTEND
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not lock file \"%s\" for writing: %m",
+						ControlFilePath)));
+#else
+		pg_fatal("could not lock file \"%s\" for writing: %m",
+				 ControlFilePath);
+#endif
+	}
+
 	errno = 0;
 #ifndef FRONTEND
 	pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE);
@@ -225,6 +347,10 @@ update_controlfile(const char *DataDir,
 #endif
 	}
 
+#ifdef WIN32
+	unlock_controlfile(fd);
+#endif
+
 	if (close(fd) != 0)
 	{
 #ifndef FRONTEND
diff --git a/src/include/common/controldata_utils.h b/src/include/common/controldata_utils.h
index 49e7c52d31..06825cad85 100644
--- a/src/include/common/controldata_utils.h
+++ b/src/include/common/controldata_utils.h
@@ -15,5 +15,9 @@
 extern ControlFileData *get_controlfile(const char *DataDir, bool *crc_ok_p);
 extern void update_controlfile(const char *DataDir,
 							   ControlFileData *ControlFile, bool do_sync);
+extern int lock_controlfile(int fd, bool exclusive);
+#ifdef WIN32
+extern int unlock_controlfile(int fd);
+#endif
 
 #endif							/* COMMON_CONTROLDATA_UTILS_H */
-- 
2.35.1

From fa594006c3ef7e958d870978c631b003aa680c18 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Thu, 16 Feb 2023 20:02:22 +1300
Subject: [PATCH 2/3] Make wal_sync_method=fdatasync the default on all OSes.

Previously, fdatasync was the default level on Linux and FreeBSD by
special rules, and on OpenBSD and DragonflyBSD because they didn't have
O_DSYNC != O_SYNC or O_DSYNC at all.  For every other system, we'd
choose open_datasync.

Use fdatasync everywhere, for consistency.  This became possible after
commit 9430fb40 added support for Windows, the last known system not to
have it.  This means that we'll now flush caches on consumer drives by
default on Windows, where previously we didn't, which seems like a
better default for crash safety.  Users who want the older behavior can
still request it with wal_sync_method=open_datasync.
---
 doc/src/sgml/config.sgml              |  5 ++---
 doc/src/sgml/wal.sgml                 |  8 ++++----
 src/bin/pg_test_fsync/pg_test_fsync.c |  4 ++--
 src/include/access/xlogdefs.h         | 14 +-------------
 src/include/port/freebsd.h            |  7 -------
 src/include/port/linux.h              |  8 --------
 6 files changed, 9 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ecd9aa73ef..b78d374d7e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3107,9 +3107,8 @@ include_dir 'conf.d'
        <para>
         The <literal>open_</literal>* options also use <literal>O_DIRECT</literal> if available.
         Not all of these choices are available on all platforms.
-        The default is the first method in the above list that is supported
-        by the platform, except that <literal>fdatasync</literal> is the default on
-        Linux and FreeBSD.  The default is not necessarily ideal; it might be
+        The default is <literal>fdatasync</literal>.
+        The default is not necessarily ideal; it might be
         necessary to change this setting or other aspects of your system
         configuration in order to create a crash-safe configuration or
         achieve optimal performance.
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index ed7929cbcd..ffc22a7ded 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -106,12 +106,12 @@
     <listitem>
       <para>
         On <productname>Windows</productname>, if <varname>wal_sync_method</varname> is
-        <literal>open_datasync</literal> (the default), write caching can be disabled
+        <literal>open_datasync</literal>, write caching can be disabled
         by unchecking <literal>My Computer\Open\<replaceable>disk drive</replaceable>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</literal>.
         Alternatively, set <varname>wal_sync_method</varname> to
-        <literal>fdatasync</literal> (NTFS only), <literal>fsync</literal> or
-        <literal>fsync_writethrough</literal>, which prevent
-        write caching.
+        <literal>fdatasync</literal> (the default), <literal>fsync</literal> or
+        <literal>fsync_writethrough</literal>, which use system calls that
+        flush write caches.
       </para>
     </listitem>
 
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index 3d5e8f30ab..45bd2daf2e 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -292,7 +292,7 @@ test_sync(int writes_per_op)
 		printf(_("\nCompare file sync methods using one %dkB write:\n"), XLOG_BLCKSZ_K);
 	else
 		printf(_("\nCompare file sync methods using two %dkB writes:\n"), XLOG_BLCKSZ_K);
-	printf(_("(in wal_sync_method preference order, except fdatasync is Linux's default)\n"));
+	printf(_("(fdatasync is the default)\n"));
 
 	/*
 	 * Test open_datasync if available
@@ -326,7 +326,7 @@ test_sync(int writes_per_op)
 #endif
 
 /*
- * Test fdatasync if available
+ * Test fdatasync
  */
 	printf(LABEL_FORMAT, "fdatasync");
 	fflush(stdout);
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index fe794c7740..a628976902 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -64,19 +64,7 @@ typedef uint32 TimeLineID;
  */
 typedef uint16 RepOriginId;
 
-/*
- * This chunk of hackery attempts to determine which file sync methods
- * are available on the current platform, and to choose an appropriate
- * default method.
- *
- * Note that we define our own O_DSYNC on Windows, but not O_SYNC.
- */
-#if defined(PLATFORM_DEFAULT_SYNC_METHOD)
-#define DEFAULT_SYNC_METHOD		PLATFORM_DEFAULT_SYNC_METHOD
-#elif defined(O_DSYNC) && (!defined(O_SYNC) || O_DSYNC != O_SYNC)
-#define DEFAULT_SYNC_METHOD		SYNC_METHOD_OPEN_DSYNC
-#else
+/* Default synchronization method for WAL. */
 #define DEFAULT_SYNC_METHOD		SYNC_METHOD_FDATASYNC
-#endif
 
 #endif							/* XLOG_DEFS_H */
diff --git a/src/include/port/freebsd.h b/src/include/port/freebsd.h
index 0e3fde55d6..2e36d3da4f 100644
--- a/src/include/port/freebsd.h
+++ b/src/include/port/freebsd.h
@@ -1,8 +1 @@
 /* src/include/port/freebsd.h */
-
-/*
- * Set the default wal_sync_method to fdatasync.  xlogdefs.h's normal rules
- * would prefer open_datasync on FreeBSD 13+, but that is not a good choice on
- * many systems.
- */
-#define PLATFORM_DEFAULT_SYNC_METHOD	SYNC_METHOD_FDATASYNC
diff --git a/src/include/port/linux.h b/src/include/port/linux.h
index 7a6e46cdbb..acd867606c 100644
--- a/src/include/port/linux.h
+++ b/src/include/port/linux.h
@@ -12,11 +12,3 @@
  * to have a kernel version test here.
  */
 #define HAVE_LINUX_EIDRM_BUG
-
-/*
- * Set the default wal_sync_method to fdatasync.  With recent Linux versions,
- * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
- * perform better and (b) causes outright failures on ext4 data=journal
- * filesystems, because those don't support O_DIRECT.
- */
-#define PLATFORM_DEFAULT_SYNC_METHOD	SYNC_METHOD_FDATASYNC
-- 
2.35.1

From c9dcb77263a7f86b187231a83083bfbe3a0f30b4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 31 Jan 2023 21:08:12 +1300
Subject: [PATCH 3/3] Apply wal_sync_method to pg_control file.

When writing out the control file frequently on a standby, we can go a
little faster on some systems by using fdatasync() instead of fsync(),
with no loss of safety.  We already respected the special
"wal_sync_method=fsync_writethrough" via a circuitous route, but we
might as well support the full range of methods.
---
 src/backend/access/transam/xlog.c      | 19 ++++++++-
 src/common/controldata_utils.c         | 59 ++++++++++++++++++++------
 src/include/common/controldata_utils.h |  7 ++-
 3 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9b9c3294e8..27c1cb8273 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4185,7 +4185,24 @@ ReadControlFile(void)
 static void
 UpdateControlFile(void)
 {
-	update_controlfile(DataDir, ControlFile, true);
+	int			sync_op;
+
+	switch (sync_method)
+	{
+		case SYNC_METHOD_FDATASYNC:
+			sync_op = UPDATE_CONTROLFILE_FDATASYNC;
+			break;
+		case SYNC_METHOD_OPEN:
+			sync_op = UPDATE_CONTROLFILE_O_SYNC;
+			break;
+		case SYNC_METHOD_OPEN_DSYNC:
+			sync_op = UPDATE_CONTROLFILE_O_DSYNC;
+			break;
+		default:
+			sync_op = UPDATE_CONTROLFILE_FSYNC;
+	}
+
+	update_controlfile(DataDir, ControlFile, sync_op);
 }
 
 /*
diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 45b43ae546..3fc2a00d44 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -239,18 +239,39 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
  * update_controlfile()
  *
  * Update controlfile values with the contents given by caller.  The
- * contents to write are included in "ControlFile". "do_sync" can be
- * optionally used to flush the updated control file.  Note that it is up
- * to the caller to properly lock ControlFileLock when calling this
- * routine in the backend.
+ * contents to write are included in "ControlFile".  "sync_op" can be set
+ * to 0 or a synchronization method UPDATE_CONTROLFILE_XXX.  In frontend code,
+ * only 0 and non-zero (fsync) are meaningful.
+ * Note that it is up to the caller to properly lock ControlFileLock when
+ * calling this routine in the backend.
  */
 void
 update_controlfile(const char *DataDir,
-				   ControlFileData *ControlFile, bool do_sync)
+				   ControlFileData *ControlFile,
+				   int sync_op)
 {
 	int			fd;
 	char		buffer[PG_CONTROL_FILE_SIZE];
 	char		ControlFilePath[MAXPGPATH];
+	int			open_flag;
+
+	switch (sync_op)
+	{
+#ifndef FRONTEND
+#ifdef O_SYNC
+		case UPDATE_CONTROLFILE_O_SYNC:
+			open_flag = O_SYNC;
+			break;
+#endif
+#ifdef O_DSYNC
+		case UPDATE_CONTROLFILE_O_DSYNC:
+			open_flag = O_DSYNC;
+			break;
+#endif
+#endif
+		default:
+			open_flag = 0;
+	}
 
 	/* Update timestamp  */
 	ControlFile->time = (pg_time_t) time(NULL);
@@ -278,13 +299,13 @@ update_controlfile(const char *DataDir,
 	 * All errors issue a PANIC, so no need to use OpenTransientFile() and to
 	 * worry about file descriptor leaks.
 	 */
-	if ((fd = BasicOpenFile(ControlFilePath, O_RDWR | PG_BINARY)) < 0)
+	if ((fd = BasicOpenFile(ControlFilePath, O_RDWR | PG_BINARY | open_flag)) < 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m",
 						ControlFilePath)));
 #else
-	if ((fd = open(ControlFilePath, O_WRONLY | PG_BINARY,
+	if ((fd = open(ControlFilePath, O_WRONLY | PG_BINARY | open_flag,
 				   pg_file_create_mode)) == -1)
 		pg_fatal("could not open file \"%s\": %m", ControlFilePath);
 #endif
@@ -331,17 +352,29 @@ update_controlfile(const char *DataDir,
 	pgstat_report_wait_end();
 #endif
 
-	if (do_sync)
+	if (sync_op != 0)
 	{
 #ifndef FRONTEND
 		pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE);
-		if (pg_fsync(fd) != 0)
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							ControlFilePath)));
+		if (sync_op == UPDATE_CONTROLFILE_FDATASYNC)
+		{
+			if (fdatasync(fd) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not fdatasync file \"%s\": %m",
+								ControlFilePath)));
+		}
+		else if (sync_op == UPDATE_CONTROLFILE_FSYNC)
+		{
+			if (pg_fsync(fd) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not fdatasync file \"%s\": %m",
+								ControlFilePath)));
+		}
 		pgstat_report_wait_end();
 #else
+		/* In frontend code, non-zero sync_op gets you plain fsync(). */
 		if (fsync(fd) != 0)
 			pg_fatal("could not fsync file \"%s\": %m", ControlFilePath);
 #endif
diff --git a/src/include/common/controldata_utils.h b/src/include/common/controldata_utils.h
index 06825cad85..de6de5571e 100644
--- a/src/include/common/controldata_utils.h
+++ b/src/include/common/controldata_utils.h
@@ -12,9 +12,14 @@
 
 #include "catalog/pg_control.h"
 
+#define UPDATE_CONTROLFILE_FSYNC 1
+#define UPDATE_CONTROLFILE_FDATASYNC 2
+#define UPDATE_CONTROLFILE_O_SYNC 3
+#define UPDATE_CONTROLFILE_O_DSYNC 4
+
 extern ControlFileData *get_controlfile(const char *DataDir, bool *crc_ok_p);
 extern void update_controlfile(const char *DataDir,
-							   ControlFileData *ControlFile, bool do_sync);
+							   ControlFileData *ControlFile, int sync_op);
 extern int lock_controlfile(int fd, bool exclusive);
 #ifdef WIN32
 extern int unlock_controlfile(int fd);
-- 
2.35.1

Reply via email to