On 2021/09/08 10:45, Kyotaro Horiguchi wrote:
Anyway there's no guarantee on the archive ordering. As discussed in the nearby thread [1], newer segment is often archived earlier. I agree that that happens mainly on busy servers, though. The archiver is designed to handle such "gaps" and/or out-of-order segment notifications. We could impose a strict ordering on archiving but I think we would take total performance than such strictness.
Yes, there are other cases causing newer WAL file to be archived eariler. The issue can happen if XLogArchiveNotify() fails to create .ready file, for example. Fixing only the case that we're discussing here is not enough. If *general* fix is discussed at the thread you told, it's better to do nothing here for the issue and to just make the startup process call XLogArchiveCheckDone() if it finds the WAL file including XLOG_SWITCH record.
At least currently, recovery fetches segments by a single process and every file is marked immediately after being filled-up, so all files other than the latest one in pg_wal including history files should have been marked for sure unless file system gets into a trouble.
You can reproduce that situation easily by starting the server with archive_mode=off, generating WAL files, sometimes running pg_switch_wal(), causing the server to crash, and then restarting the server with archive_mode=on. At the beginning of recovery, all the WAL files in pg_wal don't have their archive notification files at all. Then, with the patch, only WAL files including XLOG_SWITCH are notified for WAL archiving during recovery. The other WAL files will be notified at the subsequent checkpoint.
I'm not sure I like that XLogWalRcvClose hides the segment-switch condition. If we do that check in the function, I'd like to name the function XLogWalRcvCloseIfSwitched or something indicates the condition. I'd like to invert the condition to reduce indentation, too.
We can move the condition-check out of the function like the attached patch.
Why don't we call it just after writing data, as my first proposal did? There's no difference in functionality between doing that and the patch. If we do so, recvFile>=0 is always true and that condition can be removed but that would be optional. Anyway, by doing that, no longer need to call the function twice or we can even live without the new function.
I think that it's better and more robust to confirm that the currently-opened WAL file is valid target one to write WAL *before* actually writing any data into it. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 9a2bc37fd7..8d7f52352d 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -125,6 +125,7 @@ static void WalRcvDie(int code, Datum arg); static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len); static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr); static void XLogWalRcvFlush(bool dying); +static void XLogWalRcvClose(XLogRecPtr recptr); static void XLogWalRcvSendReply(bool force, bool requestReply); static void XLogWalRcvSendHSFeedback(bool immed); static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime); @@ -883,42 +884,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) { int segbytes; - if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size)) + /* Close the current segment if it's completed */ + if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size)) + XLogWalRcvClose(recptr); + + if (recvFile < 0) { - /* - * fsync() and close current file before we switch to next one. We - * would otherwise have to reopen this file to fsync it later - */ - if (recvFile >= 0) - { - char xlogfname[MAXFNAMELEN]; - - XLogWalRcvFlush(false); - - XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size); - - /* - * XLOG segment files will be re-read by recovery in startup - * process soon, so we don't advise the OS to release cache - * pages associated with the file like XLogFileClose() does. - */ - if (close(recvFile) != 0) - ereport(PANIC, - (errcode_for_file_access(), - errmsg("could not close log segment %s: %m", - xlogfname))); - - /* - * Create .done file forcibly to prevent the streamed segment - * from being archived later. - */ - if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS) - XLogArchiveForceDone(xlogfname); - else - XLogArchiveNotify(xlogfname); - } - recvFile = -1; - /* Create/use new log file */ XLByteToSeg(recptr, recvSegNo, wal_segment_size); recvFile = XLogFileInit(recvSegNo); @@ -967,6 +938,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) /* Update shared-memory status */ pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write); + + /* + * Close the current segment if it's fully written up in the last cycle of + * the loop, to create its archive notification file soon. Otherwise WAL + * archiving of the segment will be delayed until any data in the next + * segment is received and written. + */ + if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size)) + XLogWalRcvClose(recptr); } /* @@ -1020,6 +1000,52 @@ XLogWalRcvFlush(bool dying) } } +/* + * Close the current segment. + * + * Flush the segment to disk before closing it. Otherwise we have to + * reopen and fsync it later. + * + * Create an archive notification file since the segment is known completed. + */ +static void +XLogWalRcvClose(XLogRecPtr recptr) +{ + char xlogfname[MAXFNAMELEN]; + + Assert(recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size)); + + /* + * fsync() and close current file before we switch to next one. We + * would otherwise have to reopen this file to fsync it later + */ + XLogWalRcvFlush(false); + + XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size); + + /* + * XLOG segment files will be re-read by recovery in startup process + * soon, so we don't advise the OS to release cache pages associated + * with the file like XLogFileClose() does. + */ + if (close(recvFile) != 0) + ereport(PANIC, + (errcode_for_file_access(), + errmsg("could not close log segment %s: %m", + xlogfname))); + + /* + * Create .done file forcibly to prevent the streamed segment from + * being archived later. + */ + if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS) + XLogArchiveForceDone(xlogfname); + else + XLogArchiveNotify(xlogfname); + + recvFile = -1; +} + /* * Send reply message to primary, indicating our current WAL locations, oldest * xmin and the current time.