At Wed, 1 Sep 2021 15:01:43 +0900, Fujii Masao <masao.fu...@oss.nttdata.com> wrote in > > > On 2021/09/01 12:15, Andres Freund wrote: > > Hi, > > On 2021-09-01 11:34:34 +0900, Fujii Masao wrote: > >> On 2021/09/01 0:53, Andres Freund wrote: > >>> Of course, we need to be careful to not weaken WAL validity checking > >>> too > >>> much. How about the following: > >>> > >>> If we're "aborting" a continued record, we set > >>> XLP_FIRST_IS_ABORTED_PARTIAL on > >>> the page at which we do so (i.e. the page after the valid end of the > >>> WAL). > >> > >> When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set > >> when recovery finds a a partially-flushed segment-spanning record? > >> But maybe we cannot do that (i.e., cannot overwrite the page) because > >> the page that the flag is set in might have already been archived. No? > > I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the > > "tail > > end" of a partial record. I.e. if there's a partial record starting in > > the > > successfully archived segment A, but the end of the record, in B, has > > not been > > written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL > > at the > > end of the valid data in B. Which could not have been archived yet, or > > we'd > > not have a partial record. So we should never need to set the flag on > > an > > already archived page. > > Thanks for clarifying that! Unless I misunderstand that, when recovery > ends > at a partially-flushed segment-spanning record, it sets > XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting > writing > new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or > XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following > a partially-flushed segment-spanning record. If none of them is set, > the validation code in recovery should report an error. > > Yes, this design looks good to me.
So, this is a crude PoC of that. At the end of recovery: - When XLogReadRecord misses a page where the next part of the current continuation record should be seen, xlogreader->ContRecAbortPtr is set to the beginning of the missing page. - When StartupXLOG receives a valid ContRecAbortPtr, the value is used as the next WAL insertion location then sets the same to XLogCtl->contAbortedRecPtr. - When XLogCtl->contAbortedRecPtr is set, AdvanceXLInsertBuffer() (called under XLogInsertRecord()) sets XLP_FIRST_IS_ABORTED_PARTIAL flag to the page. While recovery: - When XLogReadRecord meets a XLP_FIRST_IS_ABORT_PARTIAL page, it rereads a record from there. In this PoC, 1. This patch is written on the current master, but it doesn't interfare with the seg-boundary-memorize patch since it removes the calls to RegisterSegmentBoundary. 2. Since xlogreader cannot emit a log-message immediately, we don't have a means to leave a log message to inform recovery met an aborted partial continuation record. (In this PoC, it is done by fprintf:p) 3. Myebe we need to pg_waldump to show partial continuation records, but I'm not sure how to realize that. 4. By the way, is this (method) applicable in this stage? The attached first is the PoC including debug-crash aid. The second is the repro script. It failes to reproduce the situation once per several trials. The following log messages are shown by a run of the script. > ... > TRAP: FailedAssertion("c++ < 1", File: "xlog.c", Line: 2675, PID: 254644) > ... > LOG: database system is shut down > ... > > LOG: redo starts at 0/2000028 > LOG: redo done at 0/6FFFFA8 system usage:... > LOG: #### Recovery finished: ContRecAbort: 0/7000000 (EndRecPtr: 0/6FFFFE8) The record from 6FFFFE8 is missing the trailing part after 7000000. > LOG: #### EndOfLog=0/7000000 > LOG: #### set XLP_FIRST_IS_ABORT_PARTIAL@0/7000000 So, WAL insertion starts from 7000000 and the first page is set the flag. > LOG: database system is ready to accept connections > ... > LOG: database system is shut down > ... > ######################### > ... > LOG: redo starts at 0/2000028 > LOG: consistent recovery state reached at 0/2000100 > ... > LOG: restored log file "000000010000000000000007" from archive > #### aborted partial continuation record found at 0/6FFFFE8, continue from > 0/7000000 The record from 6FFFFE8 is immature so skip it and continue reading from 7000000. > LOG: last completed transaction was at log time 2021-09-01 20:40:21.775295+09 > LOG: #### Recovery finished: ContRecAbort: 0/0 (EndRecPtr: 0/8000000) > LOG: restored log file "000000010000000000000007" from archive > LOG: selected new timeline ID: 2 > LOG: archive recovery complete > LOG: #### EndOfLog=0/8000000 Recovery ends. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 24165ab03e..b0f18e4e5e 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -111,6 +111,7 @@ int CommitSiblings = 5; /* # concurrent xacts needed to sleep */ int wal_retrieve_retry_interval = 5000; int max_slot_wal_keep_size_mb = -1; bool track_wal_io_timing = false; +bool contrec_aborted = false; #ifdef WAL_DEBUG bool XLOG_DEBUG = false; @@ -586,6 +587,7 @@ typedef struct XLogCtlData XLogRecPtr replicationSlotMinLSN; /* oldest LSN needed by any slot */ XLogSegNo lastRemovedSegNo; /* latest removed/recycled XLOG segment */ + XLogRecPtr contAbortedRecPtr; /* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */ XLogRecPtr unloggedLSN; @@ -735,6 +737,10 @@ typedef struct XLogCtlData XLogSegNo latestSegBoundary; XLogRecPtr latestSegBoundaryEndPtr; + /* BEGIN: FOR DEBUGGING-CRASH USE*/ + bool crossseg; + /* END: DEBUGGING-CRASH USE*/ + slock_t segtrack_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -860,6 +866,7 @@ static XLogSource XLogReceiptSource = XLOG_FROM_ANY; /* State information for XLOG reading */ static XLogRecPtr ReadRecPtr; /* start of last record read */ static XLogRecPtr EndRecPtr; /* end+1 of last record read */ +static XLogRecPtr ContRecAbortPtr; /* end+1 of last aborted contrec */ /* * Local copies of equivalent fields in the control file. When running @@ -1178,16 +1185,10 @@ XLogInsertRecord(XLogRecData *rdata, XLByteToSeg(StartPos, StartSeg, wal_segment_size); XLByteToSeg(EndPos, EndSeg, wal_segment_size); - /* - * Register our crossing the segment boundary if that occurred. - * - * Note that we did not use XLByteToPrevSeg() for determining the - * ending segment. This is so that a record that fits perfectly into - * the end of the segment causes the latter to get marked ready for - * archival immediately. - */ - if (StartSeg != EndSeg && XLogArchivingActive()) - RegisterSegmentBoundary(EndSeg, EndPos); + /* BEGIN: FOR DEBUGGING-CRASH USE */ + if (StartSeg != EndSeg) + XLogCtl->crossseg = true; + /* END: FOR DEBUGGING-CRASH USE */ /* * Advance LogwrtRqst.Write so that it includes new block(s). @@ -2292,6 +2293,27 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic) if (!Insert->forcePageWrites) NewPage->xlp_info |= XLP_BKP_REMOVABLE; + /* + * If the last page ended with an aborted partial continuation record, + * mark it to tell the parital record is omittable. Snice this happens + * only at the end of crash recovery, no rece condition here. + */ + if (XLogCtl->contAbortedRecPtr >= NewPageBeginPtr) + { + if (XLogCtl->contAbortedRecPtr == NewPageBeginPtr) + { + NewPage->xlp_info |= XLP_FIRST_IS_ABORT_PARTIAL; + elog(LOG, "#### set XLP_FIRST_IS_ABORT_PARTIAL@%X/%X", + LSN_FORMAT_ARGS(NewPageBeginPtr)); + } + else + elog(LOG, "### incosistent abort location %X/%X, expected %X/%X", + LSN_FORMAT_ARGS(XLogCtl->contAbortedRecPtr), + LSN_FORMAT_ARGS(NewPageBeginPtr)); + + XLogCtl->contAbortedRecPtr = InvalidXLogRecPtr; + } + /* * If first page of an XLOG segment file, make it a long header. */ @@ -2644,6 +2666,17 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) { issue_xlog_fsync(openLogFile, openLogSegNo); + /* BEGIN: FOR DEBUGGING-CRASH USE */ + if (XLogCtl->crossseg) + { + static int c = 0; + struct stat b; + + if (stat("/tmp/hoge", &b) == 0) + Assert (c++ < 1); + } + /* END: FOR DEBUGGING-CRASH USE */ + /* signal that we need to wakeup walsenders later */ WalSndWakeupRequest(); @@ -4568,6 +4601,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, record = XLogReadRecord(xlogreader, &errormsg); ReadRecPtr = xlogreader->ReadRecPtr; EndRecPtr = xlogreader->EndRecPtr; + ContRecAbortPtr = xlogreader->ContRecAbortPtr; if (record == NULL) { if (readFile >= 0) @@ -7873,12 +7907,26 @@ StartupXLOG(void) StandbyMode = false; /* - * Re-fetch the last valid or last applied record, so we can identify the - * exact endpoint of what we consider the valid portion of WAL. + * The last record may be an immature continuation record at the end of a + * page. We continue writing from ContRecAbortPtr instead of EndRecPtr that + * case. */ - XLogBeginRead(xlogreader, LastRec); - record = ReadRecord(xlogreader, PANIC, false); - EndOfLog = EndRecPtr; + elog(LOG, "#### Recovery finished: ContRecAbort: %X/%X (EndRecPtr: %X/%X)", LSN_FORMAT_ARGS(ContRecAbortPtr), LSN_FORMAT_ARGS(EndRecPtr)); + if (XLogRecPtrIsInvalid(ContRecAbortPtr)) + { + /* + * Re-fetch the last valid or last applied record, so we can identify + * the exact endpoint of what we consider the valid portion of WAL. + */ + XLogBeginRead(xlogreader, LastRec); + record = ReadRecord(xlogreader, PANIC, false); + EndOfLog = EndRecPtr; + } + else + { + EndOfLog = ContRecAbortPtr; + XLogCtl->contAbortedRecPtr = ContRecAbortPtr; + } /* * EndOfLogTLI is the TLI in the filename of the XLOG segment containing @@ -8013,7 +8061,8 @@ StartupXLOG(void) Insert = &XLogCtl->Insert; Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec); Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog); - + elog(LOG, "#### EndOfLog=%X/%X", LSN_FORMAT_ARGS(EndOfLog)); + /* * Tricky point here: readBuf contains the *last* block that the LastRec * record spans, not the one it starts in. The last block is indeed the diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 5cf74e181a..404db7ce4d 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -294,6 +294,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) ResetDecoder(state); + state->ContRecAbortPtr = InvalidXLogRecPtr; RecPtr = state->EndRecPtr; if (state->ReadRecPtr != InvalidXLogRecPtr) @@ -319,6 +320,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) randAccess = true; } +retry: state->currRecPtr = RecPtr; targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ); @@ -444,12 +446,27 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) XLOG_BLCKSZ)); if (readOff < 0) - goto err; + goto err_partial_contrec; Assert(SizeOfXLogShortPHD <= readOff); /* Check that the continuation on next page looks valid */ pageHeader = (XLogPageHeader) state->readBuf; + if (pageHeader->xlp_info & XLP_FIRST_IS_ABORT_PARTIAL) + { + if (pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD) + { + report_invalid_record(state, + "both XLP_FIRST_IS_CONTRECORD and XLP_FIRST_IS_ABORT_PARTIAL are set at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + goto err; + } + + fprintf(stderr, "#### aborted partial continuation record found at %X/%X, continue from %X/%X\n", LSN_FORMAT_ARGS(RecPtr), LSN_FORMAT_ARGS(targetPagePtr)); + ResetDecoder(state); + RecPtr = targetPagePtr; + goto retry; + } if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD)) { report_invalid_record(state, @@ -550,6 +567,10 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) else return NULL; +err_partial_contrec: + state->ContRecAbortPtr = targetPagePtr; + fprintf(stderr, "contrec aborted@%X/%X\n", LSN_FORMAT_ARGS(state->ContRecAbortPtr)); + err: /* diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h index 3b5eceff65..6390812a5a 100644 --- a/src/include/access/xlog_internal.h +++ b/src/include/access/xlog_internal.h @@ -76,8 +76,10 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader; #define XLP_LONG_HEADER 0x0002 /* This flag indicates backup blocks starting in this page are optional */ #define XLP_BKP_REMOVABLE 0x0004 +/* This flag indicates the first record in this page breaks a contrecord */ +#define XLP_FIRST_IS_ABORT_PARTIAL 0x0008 /* All defined flag bits in xlp_info (used for validity checking of header) */ -#define XLP_ALL_FLAGS 0x0007 +#define XLP_ALL_FLAGS 0x000F #define XLogPageHeaderSize(hdr) \ (((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD) diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 21d200d3df..00a03a628c 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -175,6 +175,8 @@ struct XLogReaderState XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + XLogRecPtr ContRecAbortPtr; /* end+1 of aborted partial contrecord if + * any */ /* ---------------------------------------- * Decoded representation of current record diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h index e3f48158ce..26fc123cdb 100644 --- a/src/include/catalog/pg_control.h +++ b/src/include/catalog/pg_control.h @@ -76,6 +76,7 @@ typedef struct CheckPoint #define XLOG_END_OF_RECOVERY 0x90 #define XLOG_FPI_FOR_HINT 0xA0 #define XLOG_FPI 0xB0 +#define XLOG_ABORT_CONTRECORD x0C0 /*
PWD=`pwd` DATA=data BKUP=bkup ARCH=$PWD/arch rm -rf arch mkdir arch rm -rf $DATA initdb -D $DATA echo "restart_after_crash = off" >> $DATA/postgresql.conf echo "archive_mode=on" >> $DATA/postgresql.conf echo "archive_command='cp %p ${ARCH}/%f'" >> $DATA/postgresql.conf echo "restart_after_crash = off" >> $DATA/postgresql.conf rm /tmp/hoge pg_ctl -D $DATA start rm -rf $BKUP pg_basebackup -D $BKUP -h /tmp echo "archive_mode=off" >> $BKUP/postgresql.conf echo "restore_command='cp ${ARCH}/%f %p'" >> $BKUP/postgresql.conf touch bkup/recovery.signal psql -c 'create table t(a int); insert into t (select a from generate_series(0, 600000) a)' touch /tmp/hoge psql -c 'insert into t (select a from generate_series(0, 600000) a)' rm /tmp/hoge sleep 5 pg_ctl -D $DATA -w start psql -c 'checkpoint' pg_ctl -D $DATA -w stop echo "#########################" pg_ctl -D $BKUP -w start sleep 10 pg_ctl -D $BKUP -w stop