Thank you for the comments! (Sorry for the late resopnse.) At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas <robertmh...@gmail.com> wrote in > On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi > <horikyota....@gmail.com> wrote: > > The patch assumed that CHKPT_START/COMPLETE barrier are exclusively > > used each other, but MarkBufferDirtyHint which delays checkpoint start > > is called in RelationTruncate while delaying checkpoint completion. > > That is not a strange nor harmful behavior. I changed delayChkpt to a > > bitmap integer from an enum so that both barrier are separately > > triggered. > > > > I'm not sure this is the way to go here, though. This fixes the issue > > of a crash during RelationTruncate, but the issue of smgrtruncate > > failure during RelationTruncate still remains (unless we treat that > > failure as PANIC?). > > I like this patch. As I understand it, we're currently cheating by > allowing checkpoints to complete without necessarily flushing all of > the pages that were dirty at the time we fixed the redo pointer out to > disk. We think this is OK because we know that those pages are going > to get truncated away, but it's not really OK because when the system > starts up, it has to replay WAL starting from the checkpoint's redo > pointer, but the state of the page is not the same as it was at the > time when the redo pointer was the end of WAL, so redo fails. In the > case described in > http://postgr.es/m/byapr06mb63739b2692dc6dbb3c5f186cab...@byapr06mb6373.namprd06.prod.outlook.com > modifications are made to the page before the redo pointer is fixed > and those changes never make it to disk, but the truncation also never > makes it to the disk either. With this patch, that can't happen, > because no checkpoint can intervene between when we (1) decide we're > not going to bother writing those dirty pages and (2) actually > truncate them away. So either the pages will get written as part of > the checkpoint, or else they'll be gone before the checkpoint > completes. In the latter case, I suppose redo that would have modified > those pages will just be skipped, thus dodging the problem.
I think your understanding is right. > In RelationTruncate, I suggest that we ought to clear the > delay-checkpoint flag before rather than after calling > FreeSpaceMapVacuumRange. Since the free space map is not fully > WAL-logged, anything we're doing there should be non-critical. Also, I Agreed and fixed. > think it might be better if MarkBufferDirtyHint stays closer to the > existing coding and just uses a Boolean and an if-test to decide > whether to clear the bit, instead of inventing a new mechanism. I > don't really see anything wrong with the new mechanism, but I think > it's better to keep the patch minimal. Yeah, that was a a kind of silly. Fixed. > As you say, this doesn't fix the problem that truncation might fail. > But as Andres and Sawada-san said, the solution to that is to get rid > of the comments saying that it's OK for truncation to fail and make it > a PANIC. However, I don't think that change needs to be part of this > patch. Even if we do that, we still need to do this. And even if we do > this, we still need to do that. Ok. Addition to the aboves, I rewrote the comment in RelatinoTruncate. + * Delay the concurrent checkpoint's completion until this truncation + * successfully completes, so that we don't establish a redo-point between + * buffer deletion and file-truncate. Otherwise we can leave inconsistent + * file content against the WAL records after the REDO position and future + * recovery fails. However, a problem for me for now is that I cannot reproduce the problem. To avoid further confusion, the attached is named as *.patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index e6c70ed0bc..17357179e3 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -3075,8 +3075,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB) * crash/basebackup, even though the state of the data directory would * require it. */ - Assert(!MyProc->delayChkpt); - MyProc->delayChkpt = true; + Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0); + MyProc->delayChkpt |= DELAY_CHKPT_START; /* WAL log truncation */ WriteMTruncateXlogRec(newOldestMultiDB, @@ -3102,7 +3102,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB) /* Then offsets */ PerformOffsetsTruncation(oldestMulti, newOldestMulti); - MyProc->delayChkpt = false; + MyProc->delayChkpt &= ~DELAY_CHKPT_START; END_CRIT_SECTION(); LWLockRelease(MultiXactTruncationLock); diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 2156de187c..b7dc84d6e3 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -463,7 +463,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid, proc->lxid = (LocalTransactionId) xid; proc->xid = xid; Assert(proc->xmin == InvalidTransactionId); - proc->delayChkpt = false; + proc->delayChkpt = 0; proc->statusFlags = 0; proc->pid = 0; proc->backendId = InvalidBackendId; @@ -1109,7 +1109,8 @@ EndPrepare(GlobalTransaction gxact) START_CRIT_SECTION(); - MyProc->delayChkpt = true; + Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0); + MyProc->delayChkpt |= DELAY_CHKPT_START; XLogBeginInsert(); for (record = records.head; record != NULL; record = record->next) @@ -1152,7 +1153,7 @@ EndPrepare(GlobalTransaction gxact) * checkpoint starting after this will certainly see the gxact as a * candidate for fsyncing. */ - MyProc->delayChkpt = false; + MyProc->delayChkpt &= ~DELAY_CHKPT_START; /* * Remember that we have this GlobalTransaction entry locked for us. If @@ -2215,7 +2216,8 @@ RecordTransactionCommitPrepared(TransactionId xid, START_CRIT_SECTION(); /* See notes in RecordTransactionCommit */ - MyProc->delayChkpt = true; + Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0); + MyProc->delayChkpt |= DELAY_CHKPT_START; /* * Emit the XLOG commit record. Note that we mark 2PC commits as @@ -2263,7 +2265,7 @@ RecordTransactionCommitPrepared(TransactionId xid, TransactionIdCommitTree(xid, nchildren, children); /* Checkpoint can proceed now */ - MyProc->delayChkpt = false; + MyProc->delayChkpt &= ~DELAY_CHKPT_START; END_CRIT_SECTION(); diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 6597ec45a9..4a1a0c3c1f 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -1334,8 +1334,9 @@ RecordTransactionCommit(void) * This makes checkpoint's determination of which xacts are delayChkpt * a bit fuzzy, but it doesn't matter. */ + Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0); START_CRIT_SECTION(); - MyProc->delayChkpt = true; + MyProc->delayChkpt |= DELAY_CHKPT_START; SetCurrentTransactionStopTimestamp(); @@ -1436,7 +1437,7 @@ RecordTransactionCommit(void) */ if (markXidCommitted) { - MyProc->delayChkpt = false; + MyProc->delayChkpt &= ~DELAY_CHKPT_START; END_CRIT_SECTION(); } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a749d..a4d564323a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9153,18 +9153,30 @@ CreateCheckPoint(int flags) * and we will correctly flush the update below. So we cannot miss any * xacts we need to wait for. */ - vxids = GetVirtualXIDsDelayingChkpt(&nvxids); + vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START); if (nvxids > 0) { do { pg_usleep(10000L); /* wait for 10 msec */ - } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids)); + } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids, + DELAY_CHKPT_START)); } pfree(vxids); CheckPointGuts(checkPoint.redo, flags); + vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE); + if (0 && nvxids > 0) + { + do + { + pg_usleep(10000L); /* wait for 10 msec */ + } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids, + DELAY_CHKPT_COMPLETE)); + } + pfree(vxids); + /* * Take a snapshot of running transactions and write this to WAL. This * allows us to reconstruct the state of running transactions during diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index b492c656d7..f7a1f981d5 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -978,7 +978,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std) /* * Ensure no checkpoint can change our view of RedoRecPtr. */ - Assert(MyProc->delayChkpt); + Assert((MyProc->delayChkpt & DELAY_CHKPT_START) != 0); /* * Update RedoRecPtr so that we can make the right decision diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index c5ad28d71f..be9c0e107f 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -325,6 +325,16 @@ RelationTruncate(Relation rel, BlockNumber nblocks) RelationPreTruncate(rel); + /* + * Delay the concurrent checkpoint's completion until this truncation + * successfully completes, so that we don't establish a redo-point between + * buffer deletion and file-truncate. Otherwise we can leave inconsistent + * file content against the WAL records after the REDO position and future + * recovery fails. + */ + Assert((MyProc->delayChkpt & DELAY_CHKPT_COMPLETE) == 0); + MyProc->delayChkpt |= DELAY_CHKPT_COMPLETE; + /* * We WAL-log the truncation before actually truncating, which means * trouble if the truncation fails. If we then crash, the WAL replay @@ -366,6 +376,10 @@ RelationTruncate(Relation rel, BlockNumber nblocks) /* Do the real work to truncate relation forks */ smgrtruncate(RelationGetSmgr(rel), forks, nforks, blocks); + + /* FSM is not WAL-logged, finish the critical section here. */ + MyProc->delayChkpt &= ~DELAY_CHKPT_COMPLETE; + /* * Update upper-level FSM pages to account for the truncation. This is * important because the just-truncated pages were likely marked as diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index e88e4e918b..c277dc3e1e 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -3921,7 +3921,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std) * essential that CreateCheckpoint waits for virtual transactions * rather than full transactionids. */ - MyProc->delayChkpt = delayChkpt = true; + Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0); + MyProc->delayChkpt |= DELAY_CHKPT_START; + delayChkpt = true; lsn = XLogSaveBufferForHint(buffer, buffer_std); } @@ -3954,7 +3956,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std) UnlockBufHdr(bufHdr, buf_state); if (delayChkpt) - MyProc->delayChkpt = false; + MyProc->delayChkpt &= ~DELAY_CHKPT_START; if (dirtied) { diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index bd3c7a47fe..1bc4ea15e9 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -689,7 +689,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid) proc->lxid = InvalidLocalTransactionId; proc->xmin = InvalidTransactionId; - proc->delayChkpt = false; /* be sure this is cleared in abort */ + + /* be sure this is cleared in abort */ + proc->delayChkpt = 0; + proc->recoveryConflictPending = false; /* must be cleared with xid/xmin: */ @@ -728,7 +731,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid) proc->xid = InvalidTransactionId; proc->lxid = InvalidLocalTransactionId; proc->xmin = InvalidTransactionId; - proc->delayChkpt = false; /* be sure this is cleared in abort */ + + /* be sure this is cleared in abort */ + proc->delayChkpt = 0; + proc->recoveryConflictPending = false; /* must be cleared with xid/xmin: */ @@ -3026,7 +3032,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly) * delaying checkpoint because they have critical actions in progress. * * Constructs an array of VXIDs of transactions that are currently in commit - * critical sections, as shown by having delayChkpt set in their PGPROC. + * critical sections, as shown by having delayChkpt set to the specified value + * in their PGPROC. * * Returns a palloc'd array that should be freed by the caller. * *nvxids is the number of valid entries. @@ -3040,13 +3047,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly) * for clearing of delayChkpt to propagate is unimportant for correctness. */ VirtualTransactionId * -GetVirtualXIDsDelayingChkpt(int *nvxids) +GetVirtualXIDsDelayingChkpt(int *nvxids, int type) { VirtualTransactionId *vxids; ProcArrayStruct *arrayP = procArray; int count = 0; int index; + Assert(type != 0); + /* allocate what's certainly enough result space */ vxids = (VirtualTransactionId *) palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs); @@ -3058,7 +3067,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids) int pgprocno = arrayP->pgprocnos[index]; PGPROC *proc = &allProcs[pgprocno]; - if (proc->delayChkpt) + if ((proc->delayChkpt & type) != 0) { VirtualTransactionId vxid; @@ -3084,12 +3093,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids) * those numbers should be small enough for it not to be a problem. */ bool -HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids) +HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type) { bool result = false; ProcArrayStruct *arrayP = procArray; int index; + Assert(type != 0); + LWLockAcquire(ProcArrayLock, LW_SHARED); for (index = 0; index < arrayP->numProcs; index++) @@ -3100,7 +3111,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids) GET_VXID_FROM_PGPROC(vxid, *proc); - if (proc->delayChkpt && VirtualTransactionIdIsValid(vxid)) + if ((proc->delayChkpt & type) != 0 && + VirtualTransactionIdIsValid(vxid)) { int i; diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index b7d9da0aa9..95fdf990e7 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -394,7 +394,7 @@ InitProcess(void) MyProc->roleId = InvalidOid; MyProc->tempNamespaceId = InvalidOid; MyProc->isBackgroundWorker = IsBackgroundWorker; - MyProc->delayChkpt = false; + MyProc->delayChkpt = 0; MyProc->statusFlags = 0; /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */ if (IsAutoVacuumWorkerProcess()) @@ -579,7 +579,7 @@ InitAuxiliaryProcess(void) MyProc->roleId = InvalidOid; MyProc->tempNamespaceId = InvalidOid; MyProc->isBackgroundWorker = IsBackgroundWorker; - MyProc->delayChkpt = false; + MyProc->delayChkpt = 0; MyProc->statusFlags = 0; MyProc->lwWaiting = false; MyProc->lwWaitMode = 0; diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h index be67d8a861..b9be2454c5 100644 --- a/src/include/storage/proc.h +++ b/src/include/storage/proc.h @@ -79,6 +79,10 @@ struct XidCache */ #define INVALID_PGPROCNO PG_INT32_MAX +/* symbols for PGPROC.delayChkpt */ +#define DELAY_CHKPT_START (1<<0) +#define DELAY_CHKPT_COMPLETE (1<<1) + typedef enum { PROC_WAIT_STATUS_OK, @@ -184,7 +188,8 @@ struct PGPROC pg_atomic_uint64 waitStart; /* time at which wait for lock acquisition * started */ - bool delayChkpt; /* true if this proc delays checkpoint start */ + int delayChkpt; /* if this proc delays checkpoint start and/or + * completion. */ uint8 statusFlags; /* this backend's status flags, see PROC_* * above. mirrored in diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h index b01fa52139..ec40130466 100644 --- a/src/include/storage/procarray.h +++ b/src/include/storage/procarray.h @@ -15,11 +15,11 @@ #define PROCARRAY_H #include "storage/lock.h" +#include "storage/proc.h" #include "storage/standby.h" #include "utils/relcache.h" #include "utils/snapshot.h" - extern Size ProcArrayShmemSize(void); extern void CreateSharedProcArray(void); extern void ProcArrayAdd(PGPROC *proc); @@ -59,8 +59,9 @@ extern TransactionId GetOldestActiveTransactionId(void); extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly); extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin); -extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids); -extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids); +extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type); +extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, + int nvxids, int type); extern PGPROC *BackendPidGetProc(int pid); extern PGPROC *BackendPidGetProcWithLock(int pid);