On 02/07/2024 02:24, Noah Misch wrote:
On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:
On 01/07/2024 23:52, Noah Misch wrote:
Commit 8af2565 wrote:
--- /dev/null
+++ b/src/backend/storage/smgr/bulk_write.c

+/*
+ * Finish bulk write operation.
+ *
+ * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
+ * the relation if needed.
+ */
+void
+smgr_bulk_finish(BulkWriteState *bulkstate)
+{
+       /* WAL-log and flush any remaining pages */
+       smgr_bulk_flush(bulkstate);
+
+       /*
+        * When we wrote out the pages, we passed skipFsync=true to avoid the
+        * overhead of registering all the writes with the checkpointer.  
Register
+        * the whole relation now.
+        *
+        * There is one hole in that idea: If a checkpoint occurred while we 
were
+        * writing the pages, it already missed fsyncing the pages we had 
written
+        * before the checkpoint started.  A crash later on would replay the WAL
+        * starting from the checkpoint, therefore it wouldn't replay our 
earlier
+        * WAL records.  So if a checkpoint started after the bulk write, fsync
+        * the files now.
+        */
+       if (!SmgrIsTemp(bulkstate->smgr))
+       {

Shouldn't this be "if (bulkstate->use_wal)"?  The GetRedoRecPtr()-based
decision is irrelevant to the !wal case.  Either we don't need fsync at all
(TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).

The point of GetRedoRecPtr() is to detect if a checkpoint has started
concurrently. It works for that purpose whether or not the bulk load is
WAL-logged. It is not compared with the LSNs of WAL records written by the
bulk load.

I think the significance of start_RedoRecPtr is it preceding all records
needed to recreate the bulk write.  If start_RedoRecPtr==GetRedoRecPtr() and
we crash after commit, we're indifferent to whether the rel gets synced at a
checkpoint before that crash or rebuilt from WAL after that crash.  If
start_RedoRecPtr!=GetRedoRecPtr(), some WAL of the bulk write is already
deleted, so only smgrimmedsync() suffices.  Overall, while it is not compared
with LSNs in WAL records, it's significant only to the extent that such a WAL
record exists.  What am I missing?

You're right. You pointed out below that we don't need to register or immediately fsync the relation if it was not WAL-logged, I missed that.

In the alternative universe that we did need to fsync() even in !use_wal case, the point of the start_RedoRecPtr==GetRedoRecPtr() was to detect whether the last checkpoint "missed" fsyncing the files that we wrote. But the point is moot now.

Unlogged tables do need to be fsync'd. The scenario is:

1. Bulk load an unlogged table.
2. Shut down Postgres cleanly
3. Pull power plug from server, and restart.

We talked about this earlier in the "Unlogged relation copy is not fsync'd"
thread [1]. I had already forgotten about that; that bug actually still
exists in back branches, and we should fix it..

[1] 
https://www.postgresql.org/message-id/flat/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965%40iki.fi

Ah, that's right.  I agree this code suffices for unlogged.  As a further
optimization, it would be valid to ignore GetRedoRecPtr() for unlogged and
always call smgrregistersync().  (For any rel, smgrimmedsync() improves on
smgrregistersync() only if we fail to reach the shutdown checkpoint.  Without
a shutdown checkpoint, unlogged rels get reset anyway.)

I don't see any functional problem, but this likely arranges for an
unnecessary sync when a checkpoint starts between mdcreate() and
here.  (The mdcreate() sync may also be unnecessary, but that's
longstanding.)
Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. It
seems hard to eliminate the redundancy. smgr_bulk_finish() could skip the
fsync, if it knew that smgrDoPendingSyncs() will do it later. However,
smgrDoPendingSyncs() might also decide to WAL-log the relation instead of
fsyncing it, and in that case we do still need the fsync.

We do not need the fsync in the "WAL-log the relation instead" case; see
https://postgr.es/m/20230921062210.ga110...@rfd.leadboat.com

Ah, true, I missed that log_newpage_range() loads the pages to the buffer cache and dirties them. That kinds of sucks actually, I wish it didn't need to dirty the buffers.

So maybe like this:

   if (use_wal) /* includes init forks */
     current logic;
   else if (unlogged)
     smgrregistersync;
   /* else temp || (permanent && wal_level=minimal): nothing to do */

Makes sense, except that we cannot distinguish between unlogged relations and permanent relations with !use_wal here.

It would be nice to have relpersistence flag in SMgrRelation. I remember wanting to have that before, although I don't remember what the context was exactly.

Fortunately, fsync() on a file that's already flushed to disk is pretty
cheap.

Yep.  I'm more concerned about future readers wondering why the function is
using LSNs to decide what to do about data that doesn't appear in WAL.  A
comment could be another way to fix that, though.

Agreed, this is all very subtle, and deserves a good comment. What do you think of the attached?

--
Heikki Linnakangas
Neon (https://neon.tech)
From 6a7a2f34b2134b055c629789aa18a4ad0c4b50a9 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Tue, 2 Jul 2024 14:30:29 +0300
Subject: [PATCH 1/1] Relax fsyncing at end of bulk load that was not
 WAL-logged

And improve the comments.
---
 src/backend/storage/smgr/bulk_write.c | 71 ++++++++++++++++++++++-----
 1 file changed, 60 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
index 4a10ece4c39..f66d718c7be 100644
--- a/src/backend/storage/smgr/bulk_write.c
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -132,19 +132,68 @@ smgr_bulk_finish(BulkWriteState *bulkstate)
 	smgr_bulk_flush(bulkstate);
 
 	/*
-	 * When we wrote out the pages, we passed skipFsync=true to avoid the
-	 * overhead of registering all the writes with the checkpointer.  Register
-	 * the whole relation now.
-	 *
-	 * There is one hole in that idea: If a checkpoint occurred while we were
-	 * writing the pages, it already missed fsyncing the pages we had written
-	 * before the checkpoint started.  A crash later on would replay the WAL
-	 * starting from the checkpoint, therefore it wouldn't replay our earlier
-	 * WAL records.  So if a checkpoint started after the bulk write, fsync
-	 * the files now.
+	 * Fsync the relation, or ask the checkpoint to register it, if necessary.
 	 */
-	if (!SmgrIsTemp(bulkstate->smgr))
+	if (SmgrIsTemp(bulkstate->smgr))
 	{
+		/* Temporary relations don't need to be fsync'd, ever */
+	}
+	else if (!bulkstate->use_wal)
+	{
+		/*
+		 * This is either an unlogged relation, or a permanent relation but we
+		 * skipped WAL-logging because wal_level=minimal:
+		 *
+		 * A) Unlogged relation
+		 *
+		 *    Unlogged relations will go away on crash, but they need to be
+		 *    fsync'd on a clean shutdown. It's sufficient to call
+		 *    smgrregistersync(), that ensures that the checkpointer will
+		 *    flush it at the shutdown checkpoint. (It will flush it on the
+		 *    next online checkpoint too, which is not strictly necessary.)
+		 *
+		 *    Note that the init-fork of an unlogged relation is not
+		 *    considered unlogged for our purposes. It's treated like a
+		 *    regular permanent relation. The callers will pass use_wal=true
+		 *    for the init fork.
+		 *
+		 * B) Permanent relation, WAL-logging skipped because wal_level=minimal
+		 *
+		 *    This is a new relation, and we didn't WAL-log the pages as we
+		 *    wrote, but they need to be fsync'd before commit.
+		 *
+		 *    We don't need to do that here, however. The fsync() is done at
+		 *    commit, by smgrDoPendingSyncs() (*).
+		 *
+		 *    (*) smgrDoPendingSyncs() might decide to WAL-log the whole
+		 *    relation at commit instead of fsyncing it, if the relation was
+		 *    very small, but it's smgrDoPendingSyncs() responsibility in any
+		 *    case.
+		 *
+		 * We cannot distinguish the two here, so conservatively assume it's
+		 * an unlogged relation. A permanent relation with wal_level=minimal
+		 * would require no actions, see above.
+		 */
+		smgrregistersync(bulkstate->smgr, bulkstate->forknum);
+	}
+	else
+	{
+		/*
+		 * Permanent relation, WAL-logged normally.
+		 *
+		 * We already WAL-logged all the pages, so they will be replayed from
+		 * WAL on crash. However, when we wrote out the pages, we passed
+		 * skipFsync=true to avoid the overhead of registering all the writes
+		 * with the checkpointer.  Register the whole relation now.
+		 *
+		 * There is one hole in that idea: If a checkpoint occurred while we
+		 * were writing the pages, it already missed fsyncing the pages we had
+		 * written before the checkpoint started.  A crash later on would
+		 * replay the WAL starting from the checkpoint, therefore it wouldn't
+		 * replay our earlier WAL records.  So if a checkpoint started after
+		 * the bulk write, fsync the files now.
+		 */
+
 		/*
 		 * Prevent a checkpoint from starting between the GetRedoRecPtr() and
 		 * smgrregistersync() calls.
-- 
2.39.2

Reply via email to