At Wed, 1 Sep 2021 15:01:43 +0900, Fujii Masao <masao.fu...@oss.nttdata.com> 
wrote in 
> 
> 
> On 2021/09/01 12:15, Andres Freund wrote:
> > Hi,
> > On 2021-09-01 11:34:34 +0900, Fujii Masao wrote:
> >> On 2021/09/01 0:53, Andres Freund wrote:
> >>> Of course, we need to be careful to not weaken WAL validity checking
> >>> too
> >>> much. How about the following:
> >>>
> >>> If we're "aborting" a continued record, we set
> >>> XLP_FIRST_IS_ABORTED_PARTIAL on
> >>> the page at which we do so (i.e. the page after the valid end of the
> >>> WAL).
> >>
> >> When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
> >> when recovery finds a a partially-flushed segment-spanning record?
> >> But maybe we cannot do that (i.e., cannot overwrite the page) because
> >> the page that the flag is set in might have already been archived. No?
> > I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the
> > "tail
> > end" of a partial record. I.e. if there's a partial record starting in
> > the
> > successfully archived segment A, but the end of the record, in B, has
> > not been
> > written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL
> > at the
> > end of the valid data in B. Which could not have been archived yet, or
> > we'd
> > not have a partial record.  So we should never need to set the flag on
> > an
> > already archived page.
> 
> Thanks for clarifying that! Unless I misunderstand that, when recovery
> ends
> at a partially-flushed segment-spanning record, it sets
> XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting
> writing
> new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or
> XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following
> a partially-flushed segment-spanning record. If none of them is set,
> the validation code in recovery should report an error.
> 
> Yes, this design looks good to me.

So, this is a crude PoC of that.

At the end of recovery:

- When XLogReadRecord misses a page where the next part of the current
  continuation record should be seen, xlogreader->ContRecAbortPtr is
  set to the beginning of the missing page.

- When StartupXLOG receives a valid ContRecAbortPtr, the value is used
  as the next WAL insertion location then sets the same to
  XLogCtl->contAbortedRecPtr.

- When XLogCtl->contAbortedRecPtr is set, AdvanceXLInsertBuffer()
  (called under XLogInsertRecord()) sets XLP_FIRST_IS_ABORTED_PARTIAL
  flag to the page.

While recovery:
- When XLogReadRecord meets a XLP_FIRST_IS_ABORT_PARTIAL page, it
  rereads a record from there.

In this PoC,

1. This patch is written on the current master, but it doesn't
  interfare with the seg-boundary-memorize patch since it removes the
  calls to RegisterSegmentBoundary.

2. Since xlogreader cannot emit a log-message immediately, we don't
  have a means to leave a log message to inform recovery met an
  aborted partial continuation record. (In this PoC, it is done by
  fprintf:p)

3. Myebe we need to pg_waldump to show partial continuation records,
  but I'm not sure how to realize that.

4. By the way, is this (method) applicable in this stage?


The attached first is the PoC including debug-crash aid.  The second
is the repro script.  It failes to reproduce the situation once per
several trials.

The following log messages are shown by a run of the script.

> ...
> TRAP: FailedAssertion("c++ < 1", File: "xlog.c", Line: 2675, PID: 254644)
> ...
> LOG:  database system is shut down
> ...
> 
> LOG:  redo starts at 0/2000028
> LOG:  redo done at 0/6FFFFA8 system usage:...
> LOG:  #### Recovery finished: ContRecAbort: 0/7000000 (EndRecPtr: 0/6FFFFE8)

The record from 6FFFFE8 is missing the trailing part after 7000000.

> LOG:  #### EndOfLog=0/7000000
> LOG:  #### set XLP_FIRST_IS_ABORT_PARTIAL@0/7000000

So, WAL insertion starts from 7000000 and the first page is set the flag.

> LOG:  database system is ready to accept connections
> ...
> LOG:  database system is shut down
> ...
> #########################
> ...
> LOG:  redo starts at 0/2000028
> LOG:  consistent recovery state reached at 0/2000100
> ...
> LOG:  restored log file "000000010000000000000007" from archive
> #### aborted partial continuation record found at 0/6FFFFE8, continue from 
> 0/7000000

The record from 6FFFFE8 is immature so skip it and continue reading
from 7000000.

> LOG:  last completed transaction was at log time 2021-09-01 20:40:21.775295+09
> LOG:  #### Recovery finished: ContRecAbort: 0/0 (EndRecPtr: 0/8000000)
> LOG:  restored log file "000000010000000000000007" from archive
> LOG:  selected new timeline ID: 2
> LOG:  archive recovery complete
> LOG:  #### EndOfLog=0/8000000

Recovery ends.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 24165ab03e..b0f18e4e5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -111,6 +111,7 @@ int                 CommitSiblings = 5; /* # concurrent 
xacts needed to sleep */
 int                    wal_retrieve_retry_interval = 5000;
 int                    max_slot_wal_keep_size_mb = -1;
 bool           track_wal_io_timing = false;
+bool           contrec_aborted = false;
 
 #ifdef WAL_DEBUG
 bool           XLOG_DEBUG = false;
@@ -586,6 +587,7 @@ typedef struct XLogCtlData
        XLogRecPtr      replicationSlotMinLSN;  /* oldest LSN needed by any 
slot */
 
        XLogSegNo       lastRemovedSegNo;       /* latest removed/recycled XLOG 
segment */
+       XLogRecPtr      contAbortedRecPtr;
 
        /* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */
        XLogRecPtr      unloggedLSN;
@@ -735,6 +737,10 @@ typedef struct XLogCtlData
        XLogSegNo       latestSegBoundary;
        XLogRecPtr      latestSegBoundaryEndPtr;
 
+       /* BEGIN: FOR DEBUGGING-CRASH USE*/
+       bool            crossseg;
+       /* END: DEBUGGING-CRASH USE*/
+
        slock_t         segtrack_lck;   /* locks shared variables shown above */
 } XLogCtlData;
 
@@ -860,6 +866,7 @@ static XLogSource XLogReceiptSource = XLOG_FROM_ANY;
 /* State information for XLOG reading */
 static XLogRecPtr ReadRecPtr;  /* start of last record read */
 static XLogRecPtr EndRecPtr;   /* end+1 of last record read */
+static XLogRecPtr ContRecAbortPtr;     /* end+1 of last aborted contrec */
 
 /*
  * Local copies of equivalent fields in the control file.  When running
@@ -1178,16 +1185,10 @@ XLogInsertRecord(XLogRecData *rdata,
                XLByteToSeg(StartPos, StartSeg, wal_segment_size);
                XLByteToSeg(EndPos, EndSeg, wal_segment_size);
 
-               /*
-                * Register our crossing the segment boundary if that occurred.
-                *
-                * Note that we did not use XLByteToPrevSeg() for determining 
the
-                * ending segment.  This is so that a record that fits 
perfectly into
-                * the end of the segment causes the latter to get marked ready 
for
-                * archival immediately.
-                */
-               if (StartSeg != EndSeg && XLogArchivingActive())
-                       RegisterSegmentBoundary(EndSeg, EndPos);
+               /* BEGIN: FOR DEBUGGING-CRASH USE */
+               if (StartSeg != EndSeg)
+                       XLogCtl->crossseg = true;
+               /* END: FOR DEBUGGING-CRASH USE */
 
                /*
                 * Advance LogwrtRqst.Write so that it includes new block(s).
@@ -2292,6 +2293,27 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool 
opportunistic)
                if (!Insert->forcePageWrites)
                        NewPage->xlp_info |= XLP_BKP_REMOVABLE;
 
+               /*
+                * If the last page ended with an aborted partial continuation 
record,
+                * mark it to tell the parital record is omittable. Snice this 
happens
+                * only at the end of crash recovery, no rece condition here.
+                */
+               if (XLogCtl->contAbortedRecPtr >= NewPageBeginPtr)
+               {
+                       if (XLogCtl->contAbortedRecPtr == NewPageBeginPtr)
+                       {
+                               NewPage->xlp_info |= XLP_FIRST_IS_ABORT_PARTIAL;
+                               elog(LOG, "#### set 
XLP_FIRST_IS_ABORT_PARTIAL@%X/%X",
+                                        LSN_FORMAT_ARGS(NewPageBeginPtr));
+                       }
+                       else
+                               elog(LOG, "### incosistent abort location 
%X/%X, expected %X/%X",
+                                        
LSN_FORMAT_ARGS(XLogCtl->contAbortedRecPtr),
+                                        LSN_FORMAT_ARGS(NewPageBeginPtr));
+                                        
+                       XLogCtl->contAbortedRecPtr = InvalidXLogRecPtr;
+               }
+                       
                /*
                 * If first page of an XLOG segment file, make it a long header.
                 */
@@ -2644,6 +2666,17 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
                        {
                                issue_xlog_fsync(openLogFile, openLogSegNo);
 
+                               /* BEGIN: FOR DEBUGGING-CRASH USE */
+                               if (XLogCtl->crossseg)
+                               {
+                                       static int c = 0;
+                                       struct stat b;
+
+                                       if (stat("/tmp/hoge", &b) == 0)
+                                               Assert (c++ < 1);
+                               }
+                               /* END: FOR DEBUGGING-CRASH USE */
+
                                /* signal that we need to wakeup walsenders 
later */
                                WalSndWakeupRequest();
 
@@ -4568,6 +4601,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                record = XLogReadRecord(xlogreader, &errormsg);
                ReadRecPtr = xlogreader->ReadRecPtr;
                EndRecPtr = xlogreader->EndRecPtr;
+               ContRecAbortPtr = xlogreader->ContRecAbortPtr;
                if (record == NULL)
                {
                        if (readFile >= 0)
@@ -7873,12 +7907,26 @@ StartupXLOG(void)
        StandbyMode = false;
 
        /*
-        * Re-fetch the last valid or last applied record, so we can identify 
the
-        * exact endpoint of what we consider the valid portion of WAL.
+        * The last record may be an immature continuation record at the end of 
a
+        * page. We continue writing from ContRecAbortPtr instead of EndRecPtr 
that
+        * case.
         */
-       XLogBeginRead(xlogreader, LastRec);
-       record = ReadRecord(xlogreader, PANIC, false);
-       EndOfLog = EndRecPtr;
+       elog(LOG, "#### Recovery finished: ContRecAbort: %X/%X (EndRecPtr: 
%X/%X)", LSN_FORMAT_ARGS(ContRecAbortPtr), LSN_FORMAT_ARGS(EndRecPtr));
+       if (XLogRecPtrIsInvalid(ContRecAbortPtr))
+       {
+               /*
+                * Re-fetch the last valid or last applied record, so we can 
identify
+                * the exact endpoint of what we consider the valid portion of 
WAL.
+                */
+               XLogBeginRead(xlogreader, LastRec);
+               record = ReadRecord(xlogreader, PANIC, false);
+               EndOfLog = EndRecPtr;
+       }
+       else
+       {
+               EndOfLog = ContRecAbortPtr;
+               XLogCtl->contAbortedRecPtr = ContRecAbortPtr;
+       }
 
        /*
         * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
@@ -8013,7 +8061,8 @@ StartupXLOG(void)
        Insert = &XLogCtl->Insert;
        Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec);
        Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
-
+       elog(LOG, "#### EndOfLog=%X/%X", LSN_FORMAT_ARGS(EndOfLog));
+       
        /*
         * Tricky point here: readBuf contains the *last* block that the LastRec
         * record spans, not the one it starts in.  The last block is indeed the
diff --git a/src/backend/access/transam/xlogreader.c 
b/src/backend/access/transam/xlogreader.c
index 5cf74e181a..404db7ce4d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -294,6 +294,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
        ResetDecoder(state);
 
+       state->ContRecAbortPtr = InvalidXLogRecPtr;
        RecPtr = state->EndRecPtr;
 
        if (state->ReadRecPtr != InvalidXLogRecPtr)
@@ -319,6 +320,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
                randAccess = true;
        }
 
+retry:
        state->currRecPtr = RecPtr;
 
        targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
@@ -444,12 +446,27 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
                                                                                
   XLOG_BLCKSZ));
 
                        if (readOff < 0)
-                               goto err;
+                               goto err_partial_contrec;
 
                        Assert(SizeOfXLogShortPHD <= readOff);
 
                        /* Check that the continuation on next page looks valid 
*/
                        pageHeader = (XLogPageHeader) state->readBuf;
+                       if (pageHeader->xlp_info & XLP_FIRST_IS_ABORT_PARTIAL)
+                       {
+                               if (pageHeader->xlp_info & 
XLP_FIRST_IS_CONTRECORD)
+                               {
+                                       report_invalid_record(state,
+                                                                               
  "both XLP_FIRST_IS_CONTRECORD and XLP_FIRST_IS_ABORT_PARTIAL are set at 
%X/%X",
+                                                                               
  LSN_FORMAT_ARGS(RecPtr));
+                                       goto err;
+                               }
+                                       
+                               fprintf(stderr, "#### aborted partial 
continuation record found at %X/%X, continue from %X/%X\n", 
LSN_FORMAT_ARGS(RecPtr), LSN_FORMAT_ARGS(targetPagePtr));
+                               ResetDecoder(state);
+                               RecPtr = targetPagePtr;
+                               goto retry;
+                       }                               
                        if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
                        {
                                report_invalid_record(state,
@@ -550,6 +567,10 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
        else
                return NULL;
 
+err_partial_contrec:
+       state->ContRecAbortPtr = targetPagePtr;
+               fprintf(stderr, "contrec aborted@%X/%X\n", 
LSN_FORMAT_ARGS(state->ContRecAbortPtr));
+       
 err:
 
        /*
diff --git a/src/include/access/xlog_internal.h 
b/src/include/access/xlog_internal.h
index 3b5eceff65..6390812a5a 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -76,8 +76,10 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 #define XLP_LONG_HEADER                                0x0002
 /* This flag indicates backup blocks starting in this page are optional */
 #define XLP_BKP_REMOVABLE                      0x0004
+/* This flag indicates the first record in this page breaks a contrecord */
+#define XLP_FIRST_IS_ABORT_PARTIAL     0x0008
 /* All defined flag bits in xlp_info (used for validity checking of header) */
-#define XLP_ALL_FLAGS                          0x0007
+#define XLP_ALL_FLAGS                          0x000F
 
 #define XLogPageHeaderSize(hdr)                \
        (((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : 
SizeOfXLogShortPHD)
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 21d200d3df..00a03a628c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -175,6 +175,8 @@ struct XLogReaderState
        XLogRecPtr      ReadRecPtr;             /* start of last record read */
        XLogRecPtr      EndRecPtr;              /* end+1 of last record read */
 
+       XLogRecPtr      ContRecAbortPtr; /* end+1 of aborted partial contrecord 
if
+                                                                 * any */
 
        /* ----------------------------------------
         * Decoded representation of current record
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce..26fc123cdb 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -76,6 +76,7 @@ typedef struct CheckPoint
 #define XLOG_END_OF_RECOVERY                   0x90
 #define XLOG_FPI_FOR_HINT                              0xA0
 #define XLOG_FPI                                               0xB0
+#define XLOG_ABORT_CONTRECORD                  x0C0
 
 
 /*
PWD=`pwd`
DATA=data
BKUP=bkup
ARCH=$PWD/arch
rm -rf arch
mkdir arch
rm -rf $DATA
initdb -D $DATA
echo "restart_after_crash = off" >> $DATA/postgresql.conf
echo "archive_mode=on" >> $DATA/postgresql.conf
echo "archive_command='cp %p ${ARCH}/%f'" >> $DATA/postgresql.conf
echo "restart_after_crash = off" >> $DATA/postgresql.conf
rm /tmp/hoge
pg_ctl -D $DATA start
rm -rf $BKUP
pg_basebackup -D $BKUP -h /tmp
echo "archive_mode=off" >> $BKUP/postgresql.conf
echo "restore_command='cp ${ARCH}/%f %p'" >> $BKUP/postgresql.conf
touch bkup/recovery.signal

psql -c 'create table t(a int); insert into t (select a from generate_series(0, 
600000) a)'
touch /tmp/hoge
psql -c 'insert into t (select a from generate_series(0, 600000) a)'
rm /tmp/hoge
sleep 5
pg_ctl -D $DATA -w start
psql -c 'checkpoint'
pg_ctl -D $DATA -w stop
echo "#########################"
pg_ctl -D $BKUP -w start
sleep 10
pg_ctl -D $BKUP -w stop




Reply via email to