On 2021/09/08 10:45, Kyotaro Horiguchi wrote:
Anyway there's no guarantee on the archive ordering. As discussed in
the nearby thread [1], newer segment is often archived earlier. I
agree that that happens mainly on busy servers, though. The archiver
is designed to handle such "gaps" and/or out-of-order segment
notifications.  We could impose a strict ordering on archiving but I
think we would take total performance than such strictness.

Yes, there are other cases causing newer WAL file to be archived eariler.
The issue can happen if XLogArchiveNotify() fails to create .ready file,
for example. Fixing only the case that we're discussing here is not enough.
If *general* fix is discussed at the thread you told, it's better to
do nothing here for the issue and to just make the startup process call
XLogArchiveCheckDone() if it finds the WAL file including XLOG_SWITCH record.


At least currently, recovery fetches segments by a single process and
every file is marked immediately after being filled-up, so all files
other than the latest one in pg_wal including history files should
have been marked for sure unless file system gets into a trouble.

You can reproduce that situation easily by starting the server with
archive_mode=off, generating WAL files, sometimes running pg_switch_wal(),
causing the server to crash, and then restarting the server with
archive_mode=on. At the beginning of recovery, all the WAL files in pg_wal
don't have their archive notification files at all. Then, with the patch,
only WAL files including XLOG_SWITCH are notified for WAL archiving
during recovery. The other WAL files will be notified at the subsequent
checkpoint.


I'm not sure I like that XLogWalRcvClose hides the segment-switch
condition.  If we do that check in the function, I'd like to name the
function XLogWalRcvCloseIfSwitched or something indicates the
condition.  I'd like to invert the condition to reduce indentation,
too.

We can move the condition-check out of the function like the attached patch.


Why don't we call it just after writing data, as my first proposal
did? There's no difference in functionality between doing that and the
patch.  If we do so, recvFile>=0 is always true and that condition can
be removed but that would be optional.  Anyway, by doing that, no
longer need to call the function twice or we can even live without the
new function.

I think that it's better and more robust to confirm that the currently-opened
WAL file is valid target one to write WAL *before* actually writing any data
into it.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/walreceiver.c 
b/src/backend/replication/walreceiver.c
index 9a2bc37fd7..8d7f52352d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -125,6 +125,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvClose(XLogRecPtr recptr);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -883,42 +884,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
        {
                int                     segbytes;
 
-               if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, 
wal_segment_size))
+               /* Close the current segment if it's completed */
+               if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, 
wal_segment_size))
+                       XLogWalRcvClose(recptr);
+
+               if (recvFile < 0)
                {
-                       /*
-                        * fsync() and close current file before we switch to 
next one. We
-                        * would otherwise have to reopen this file to fsync it 
later
-                        */
-                       if (recvFile >= 0)
-                       {
-                               char            xlogfname[MAXFNAMELEN];
-
-                               XLogWalRcvFlush(false);
-
-                               XLogFileName(xlogfname, recvFileTLI, recvSegNo, 
wal_segment_size);
-
-                               /*
-                                * XLOG segment files will be re-read by 
recovery in startup
-                                * process soon, so we don't advise the OS to 
release cache
-                                * pages associated with the file like 
XLogFileClose() does.
-                                */
-                               if (close(recvFile) != 0)
-                                       ereport(PANIC,
-                                                       
(errcode_for_file_access(),
-                                                        errmsg("could not 
close log segment %s: %m",
-                                                                       
xlogfname)));
-
-                               /*
-                                * Create .done file forcibly to prevent the 
streamed segment
-                                * from being archived later.
-                                */
-                               if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-                                       XLogArchiveForceDone(xlogfname);
-                               else
-                                       XLogArchiveNotify(xlogfname);
-                       }
-                       recvFile = -1;
-
                        /* Create/use new log file */
                        XLByteToSeg(recptr, recvSegNo, wal_segment_size);
                        recvFile = XLogFileInit(recvSegNo);
@@ -967,6 +938,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
        /* Update shared-memory status */
        pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+
+       /*
+        * Close the current segment if it's fully written up in the last cycle 
of
+        * the loop, to create its archive notification file soon. Otherwise WAL
+        * archiving of the segment will be delayed until any data in the next
+        * segment is received and written.
+        */
+       if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+               XLogWalRcvClose(recptr);
 }
 
 /*
@@ -1020,6 +1000,52 @@ XLogWalRcvFlush(bool dying)
        }
 }
 
+/*
+ * Close the current segment.
+ *
+ * Flush the segment to disk before closing it. Otherwise we have to
+ * reopen and fsync it later.
+ *
+ * Create an archive notification file since the segment is known completed.
+ */
+static void
+XLogWalRcvClose(XLogRecPtr recptr)
+{
+       char            xlogfname[MAXFNAMELEN];
+
+       Assert(recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, 
wal_segment_size));
+
+       /*
+        * fsync() and close current file before we switch to next one. We
+        * would otherwise have to reopen this file to fsync it later
+        */
+       XLogWalRcvFlush(false);
+
+       XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+       /*
+        * XLOG segment files will be re-read by recovery in startup process
+        * soon, so we don't advise the OS to release cache pages associated
+        * with the file like XLogFileClose() does.
+        */
+       if (close(recvFile) != 0)
+               ereport(PANIC,
+                               (errcode_for_file_access(),
+                                errmsg("could not close log segment %s: %m",
+                                               xlogfname)));
+
+       /*
+        * Create .done file forcibly to prevent the streamed segment from
+        * being archived later.
+        */
+       if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+               XLogArchiveForceDone(xlogfname);
+       else
+               XLogArchiveNotify(xlogfname);
+
+       recvFile = -1;
+}
+
 /*
  * Send reply message to primary, indicating our current WAL locations, oldest
  * xmin and the current time.

Reply via email to