Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

Michael Paquier Wed, 03 Feb 2016 06:28:57 -0800

On Tue, Feb 2, 2016 at 1:42 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Sat, Jan 30, 2016 at 7:38 PM, Michael Paquier <michael.paqu...@gmail.com>
> wrote:
>>
>> On Fri, Jan 29, 2016 at 9:13 PM, Michael Paquier wrote:
>> > Well, to put it short, I am just trying to find a way to make the
>> > backend skip unnecessary checkpoints on an idle system, which results
>> > in the following WAL pattern if system is completely idle:
>> > CHECKPOINT_ONLINE
>> > RUNNING_XACTS
>> > RUNNING_XACTS
>> > [etc..]
>> >
>> > The thing is that I am lost with the meaning of this condition to
>> > decide if a checkpoint should be skipped or not:
>> >         if (prevPtr == ControlFile->checkPointCopy.redo &&
>> >             prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
>> >         {
>> >             WALInsertLockRelease();
>> >             LWLockRelease(CheckpointLock);
>> >             return;
>> >         }
>> > As at least one standby snapshot is logged before the checkpoint
>> > record, the redo position is never going to match the previous insert
>> > LSN, so checkpoints will never be skipped if wal_level >= hot_standby.
>> > Skipping such unnecessary checkpoints is what you would like to
>> > address, no? Because that's what I would like to do here first. And
>> > once we got that right, we could think about addressing the case where
>> > WAL segments are forcibly archived for idle systems.
>>
>> I have put a bit more of brain power into that, and finished with the
>> patch attached. A new field called chkpProgressLSN is attached to
>> XLogCtl, which is to the current insert position of the checkpoint
>> when wal_level <= archive, or the LSN position of the standby snapshot
>> taken after a checkpoint. The bgwriter code is modified as well so as
>> it uses this progress LSN and compares it with the current insert LSN
>> to determine if a standby snapshot should be logged or not. On an
>> instance of Postgres completely idle, no useless checkpoints, and no
>> useless standby snapshots are generated anymore.
>>
>
>
> @@ -8239,7 +8262,7 @@ CreateCheckPoint(int flags)
>   if ((flags & (CHECKPOINT_IS_SHUTDOWN |
> CHECKPOINT_END_OF_RECOVERY |
>    CHECKPOINT_FORCE)) == 0)
>   {
> - if
> (prevPtr == ControlFile->checkPointCopy.redo &&
> + if (GetProgressRecPtr() == prevPtr &&
>
> prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
>   {
>
> I think such a check won't consider all the WAL-activity happened
> during checkpoint (after checkpoint start, till it finishes) which was
> not the intention of this code without your patch.


Yes, that's true, in v2 this progress LSN can be updated in two ways:
- At the position where checkpoint has begun
- At the position where standby snapshot was taken after a checkpoint has run.
Honestly, even if we log the snapshot for example before
CheckpointGuts(), that's not going to make it... The main issue here
is that there will be a snapshot after this checkpoint, so the
existing condition is a rather broken concept already when wal_level
>= hot_standby because the redo pointer is never going to match the
previous LSN pointer. Another idea would be to ensure that the
snapshot is logged just after the redo pointer, this would be
reliable.

> I think both this and previous patch (hs-checkpoints-v1) approach
> won't fix the issue in all kind of scenario's.
>
> Let me try to explain what I think should fix this issue based on
> discussion above, suggestions by Andres and some of my own
> thoughts:
>
> Have a new variable in WALInsertLock that would indicate the last
> insertion position (last_insert_pos) we write to after holding that lock.
> Ensure that we don't update last_insert_pos while logging standby
> snapshot (simple way is to pass a flag in XLogInsert or distinguish
> it based on type of record (XLOG_RUNNING_XACTS) or if you can
> think of a more smarter way).

Simplest way is in XLogInsertRecord, the record data is available
there, there is for example some logic related to segment switches.

> Now both during checkpoint and
> in bgwriter, to find out whether there is any activity since last
> time, we need to check all the WALInsertLocks for latest insert position
> (by referring last_insert_pos) and compare it with the latest position
> we have noted during last such check and decide whether to proceed
> or not based on comparison result.  If you think it is not adequate to
> store last_insert_pos in WALInsertLock, then we can think of having
> it in PGPROC.

So the progress check is used when deciding if a checkpoint should be
skipped or not, and when deciding if a standby snapshot should be
taken by the bgwriter? When bgwriter logs a snapshot, it will update
as well the last LSN found (which would be a single variable in for
example XLogCtlData, updated from the data taken from the WAL insert
slots). But there is a problem here: if there is no activity since the
last bgwriter snapshot, the next checkpoint would be skipped as well.
It seems to me that this is not acceptable, a checkpoint generation
would be decided based on if there has been some data activity since
the last snapshot taken, or to put it in other words, no checkpoints
would happen if there has been no activity within 15s after the last
standby snapshot has been logged by the bgwriter.

I have hacked something according to those lines, please see attached
(logs are just here for testing purposes). I may be missing something
obvious regarding the tracking of the last progress position. Code
speaks better than words here I think, so feel free to look at it.

> Yet another idea that occurs to me this morning is that why not
> have a variable in shared memory in XLogCtlInsert or XLogCtl
> similar to CurrBytePos/PrevBytePos which will be updated on
> each XLOGInsert apart from the XLOGInsert for standby snapshots
> and use that in a patch somewhat close to what you have in
> hs-checkpoints-v1.

I have considered that as well, but I do not think it is a good idea
to add more contention in ReserveXLogInsertLocation() while taking
insertpos_lck lock. That's already really hot, and we surely don't
want to make it hotter just to do checks on idle systems. If that's
what you had in mind. Btw, this is basically what I did in the
attached, no? Except that the current insert position is tracked
differently.

>From v2 I sent upthread:
-               if (prevPtr == ControlFile->checkPointCopy.redo &&
+               if (GetProgressRecPtr() == prevPtr &&
                        prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
This is wrong actually. prevPtr should be replaced by the redo LSN.
-- 
Michael

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a2846c4..fe4ef8d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -507,6 +507,13 @@ typedef struct XLogCtlInsert
 	XLogRecPtr	lastBackupStart;
 
 	/*
+	 * WAL insertion progress positions. Each entry is protected by a single
+	 * WAL insert lock, and should be updated only while holding its
+	 * respective lock.
+	 */
+	XLogRecPtr	progressLSN[NUM_XLOGINSERT_LOCKS];
+
+	/*
 	 * WAL insertion locks.
 	 */
 	WALInsertLockPadded *WALInsertLocks;
@@ -527,6 +534,8 @@ typedef struct XLogCtlData
 	TransactionId ckptXid;
 	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
 	XLogRecPtr	replicationSlotMinLSN;	/* oldest LSN needed by any slot */
+	XLogRecPtr	lastProgressLSN;	/* LSN tracking progress of checkpoint
+									 * and standby snapshots */
 
 	XLogSegNo	lastRemovedSegNo;		/* latest removed/recycled XLOG
 										 * segment */
@@ -897,6 +906,7 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 	XLogRecord *rechdr = (XLogRecord *) rdata->data;
 	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
 							   rechdr->xl_info == XLOG_SWITCH);
+	bool		isStandbySnapshot = (rechdr->xl_rmid == RM_STANDBY_ID);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 
@@ -988,6 +998,27 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the progress LSN positions, WAL insertion locks are already
+	 * taken appropriately before doing that.
+	 */
+	if (!isStandbySnapshot)
+	{
+		if (holdingAllLocks)
+		{
+			int i;
+
+			for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+				if (Insert->progressLSN[i] < EndPos)
+					Insert->progressLSN[i] = EndPos;
+		}
+		else
+		{
+			if (Insert->progressLSN[MyLockNo] < EndPos)
+				Insert->progressLSN[MyLockNo] = EndPos;
+		}
+	}
+
 	if (inserted)
 	{
 		/*
@@ -4684,6 +4715,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		XLogCtl->Insert.progressLSN[i] = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -6314,6 +6346,7 @@ StartupXLOG(void)
 					 checkPoint.newestCommitTsXid);
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
+	XLogCtl->lastProgressLSN = checkPoint.redo;
 
 	/*
 	 * Initialize replication slots, before there's a chance to remove
@@ -7869,6 +7902,49 @@ GetFlushRecPtr(void)
 }
 
 /*
+ * XLOGHasActivity -- Determine if any WAL activity has occurred since
+ * the last time this routine has been used by looking at the progress
+ * of WAL. This routine should be used to decide if checkpoints should
+ * happen or if standby activity should be logged or not.
+ */
+bool
+XLOGHasActivity(bool islock)
+{
+	XLogRecPtr	progress_lsn = InvalidXLogRecPtr;
+	int			i;
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+	bool		res = false;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. First fetch all the LSN position that are tracked
+	 * by each WAL insertion slot and compare it to the last progress
+	 * position found to decide if any activity has run on the server or
+	 * not.
+	 */
+	if (islock)
+		WALInsertLockAcquireExclusive();
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		if (progress_lsn < Insert->progressLSN[i])
+			progress_lsn = Insert->progressLSN[i];
+	}
+	if (islock)
+		WALInsertLockRelease();
+
+	/* Update the progress_lsn if some activity has occured */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	if (XLogCtl->lastProgressLSN < progress_lsn)
+	{
+		XLogCtl->lastProgressLSN = progress_lsn;
+		res = true;
+	}
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return res;
+}
+
+/*
  * Get the time of the last xlog segment switch
  */
 pg_time_t
@@ -8132,7 +8208,6 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8218,17 +8293,15 @@ CreateCheckPoint(int flags)
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
+	 * If this isn't a shutdown or forced checkpoint, and if there has been no
+	 * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
+	 * duplicate checkpoints when the system is idle. That wastes log space,
+	 * and more importantly it exposes us to possible loss of both current and
+	 * previous checkpoint records if the machine crashes just as we're writing
+	 * the update.  (Perhaps it'd make even more sense to checkpoint only when
+	 * the previous checkpoint record is in a different xlog page?)
 	 *
 	 * If the previous checkpoint crossed a WAL segment, however, we create
 	 * the checkpoint anyway, to have the latest checkpoint fully contained in
@@ -8239,9 +8312,12 @@ CreateCheckPoint(int flags)
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		elog(LOG, "Not a forced or shutdown checkpoint");
+		if (!XLOGHasActivity(false) &&
+			ControlFile->checkPointCopy.redo / XLOG_SEG_SIZE ==
+			curInsert / XLOG_SEG_SIZE)
 		{
+			elog(LOG, "Checkpoint is skipped");
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -8408,7 +8484,11 @@ CreateCheckPoint(int flags)
 	 * recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		LogStandbySnapshot();
+	{
+		XLogRecPtr lsn = LogStandbySnapshot();
+		elog(LOG, "snapshot taken by checkpoint %X/%X",
+			 (uint32) (lsn >> 32), (uint32) lsn);
+	}
 
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 4ff4caf..0388fc7 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -78,12 +78,10 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Timestamp at which we last issued a LogStandbySnapshot(), to avoid doing
+ * so too often.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -301,7 +299,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -315,13 +313,13 @@ BackgroundWriterMain(void)
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * only log if enough time has passed.
 			 */
-			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+			if (now >= timeout && XLOGHasActivity(true))
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				XLogRecPtr lsn = LogStandbySnapshot();
+				elog(LOG, "snapshot taken by bgwriter %X/%X",
+					 (uint32) (lsn >> 32), (uint32) lsn);
 				last_snapshot_ts = now;
 			}
 		}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ecd30ce..43fe026 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern bool XLOGHasActivity(bool islock);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

Reply via email to