Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

Thomas Munro Tue, 01 Mar 2022 09:47:18 -0800

On Mon, Feb 28, 2022 at 2:36 PM Andres Freund <and...@anarazel.de> wrote:
> On February 27, 2022 4:19:21 PM PST, Thomas Munro <thomas.mu...@gmail.com> 
> wrote:
> >It seems a little strange to introduce a new wait event that will very
> >often appear into a stable branch, but ... it is actually telling the
> >truth, so there is that.
>
> In the back branches it needs to be at the end of the enum - I assume you 
> intended that just to be for HEAD.


Yeah.

> I wonder whether in HEAD we shouldn't make that sleep duration be computed 
> from the calculation in IsOnSchedule...

I might look into this.

> >The sleep/poll loop in RegisterSyncRequest() may also have another
> >problem.  The comment explains that it was a deliberate choice not to
> >do CHECK_FOR_INTERRUPTS() here, which may be debatable, but I don't
> >think there's an excuse to ignore postmaster death in a loop that
> >presumably becomes infinite if the checkpointer exits.  I guess we
> >could do:
> >
> >-               pg_usleep(10000L);
> >+               WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
> >WAIT_EVENT_SYNC_REQUEST);
> >
> >But... really, this should be waiting on a condition variable that the
> >checkpointer broadcasts on when the queue goes from full to not full,
> >no?  Perhaps for master only?
>
> Looks worth improving, but yes, I'd not do it in the back branches.

0003 is a first attempt at that, for master only (on top of 0002 which
is the minimal fix).  This shaves another second off
027_stream_regress.pl on my workstation.  The main thing I realised is
that I needed to hold interrupts while waiting, which seems like it
should go away with 'tombstone' files as discussed in other threads.
That's not a new problem in this patch, it just looks more offensive
to the eye when you spell it out, instead of hiding it with an
unreported sleep/poll loop...

> I do think it's worth giving that sleep a proper wait event though, even in 
> the back branches.

I'm thinking that 0002 should be back-patched all the way, but 0001
could be limited to 14.

From a9344bb2fb2a363bec4be526f87560cb212ca10b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Mon, 28 Feb 2022 11:27:05 +1300
Subject: [PATCH v2 1/3] Wake up for latches in CheckpointWriteDelay().

The checkpointer shouldn't ignore its latch.  Other backends may be
waiting for it to drain the request queue.  Hopefully real systems don't
have a full queue often, but the condition is reached easily when shared
buffers is very small.

This involves defining a new wait event, which will appear in the
pg_stat_activity view often due to spread checkpoints.

Back-patch only to 14.  Even though the problem exists in earlier
branches too, it's hard to hit there.  In 14 we stopped using signal
handlers for latches on Linux, *BSD and macOS, which were previously
hiding this problem by interrupting the sleep (though not reliably, as
the signal could arrive before the sleep begins; precisely the problem
latches address).

Reported-by: Andres Freund <and...@anarazel.de>
Discussion: https://postgr.es/m/20220226213942.nb7uvb2pamyu26dj%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml            | 4 ++++
 src/backend/postmaster/checkpointer.c   | 8 +++++++-
 src/backend/utils/activity/wait_event.c | 3 +++
 src/include/utils/wait_event.h          | 1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9fb62fec8e..8620aaddc7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2235,6 +2235,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>BaseBackupThrottle</literal></entry>
       <entry>Waiting during base backup when throttling activity.</entry>
      </row>
+     <row>
+      <entry><literal>CheckpointerWriteDelay</literal></entry>
+      <entry>Waiting between writes while performing a checkpoint.</entry>
+     </row>
      <row>
       <entry><literal>PgSleep</literal></entry>
       <entry>Waiting due to a call to <function>pg_sleep</function> or
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4488e3a443..a59c3cf020 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -484,6 +484,9 @@ CheckpointerMain(void)
 			}
 
 			ckpt_active = false;
+
+			/* We may have received an interrupt during the checkpoint. */
+			HandleCheckpointerInterrupts();
 		}
 
 		/* Check for archive_timeout and switch xlog files if necessary. */
@@ -726,7 +729,10 @@ CheckpointWriteDelay(int flags, double progress)
 		 * Checkpointer and bgwriter are no longer related so take the Big
 		 * Sleep.
 		 */
-		pg_usleep(100000L);
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
+				  100,
+				  WAIT_EVENT_CHECKPOINT_WRITE_DELAY);
+		ResetLatch(MyLatch);
 	}
 	else if (--absorb_counter <= 0)
 	{
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 60972c3a75..0706e922b5 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -485,6 +485,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_BASE_BACKUP_THROTTLE:
 			event_name = "BaseBackupThrottle";
 			break;
+		case WAIT_EVENT_CHECKPOINT_WRITE_DELAY:
+			event_name = "CheckpointWriteDelay";
+			break;
 		case WAIT_EVENT_PG_SLEEP:
 			event_name = "PgSleep";
 			break;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 395d325c5f..d0345c6b49 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -141,6 +141,7 @@ typedef enum
 typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+	WAIT_EVENT_CHECKPOINT_WRITE_DELAY,
 	WAIT_EVENT_PG_SLEEP,
 	WAIT_EVENT_RECOVERY_APPLY_DELAY,
 	WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
-- 
2.30.2

From 1eb0266fed7ccb63a2430e4fbbaef2300f2aa0d0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 1 Mar 2022 11:38:27 +1300
Subject: [PATCH v2 2/3] Fix waiting in RegisterSyncRequest().

If we run out of space in the checkpointer sync request queue (which is
hopefully rare on real systems, but common with very small buffer pool),
we wait for it to drain.  While waiting, we should report that as a wait
event so that users know what is going on, and also handle postmaster
death, since otherwise the loop might never terminate if the
checkpointer has exited.

Discussion: https://postgr.es/m/20220226213942.nb7uvb2pamyu26dj%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml            | 5 +++++
 src/backend/storage/sync/sync.c         | 4 +++-
 src/backend/utils/activity/wait_event.c | 3 +++
 src/include/utils/wait_event.h          | 1 +
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8620aaddc7..71559442f0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2254,6 +2254,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting during recovery when WAL data is not available from any
        source (<filename>pg_wal</filename>, archive or stream).</entry>
      </row>
+     <row>
+      <entry><literal>RegisterSyncRequest</literal></entry>
+      <entry>Waiting while sending synchronization requests to the
+       checkpointer, because the request queue is full.</entry>
+     </row>
      <row>
       <entry><literal>VacuumDelay</literal></entry>
       <entry>Waiting in a cost-based vacuum delay point.</entry>
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e161d57761..0c4d9ce687 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -31,6 +31,7 @@
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/latch.h"
 #include "storage/md.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -606,7 +607,8 @@ RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
 		if (ret || (!ret && !retryOnError))
 			break;
 
-		pg_usleep(10000L);
+		WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
+				  WAIT_EVENT_REGISTER_SYNC_REQUEST);
 	}
 
 	return ret;
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 0706e922b5..ff46a0e3c7 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -497,6 +497,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL:
 			event_name = "RecoveryRetrieveRetryInterval";
 			break;
+		case WAIT_EVENT_REGISTER_SYNC_REQUEST:
+			event_name = "RegisterSyncRequest";
+			break;
 		case WAIT_EVENT_VACUUM_DELAY:
 			event_name = "VacuumDelay";
 			break;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index d0345c6b49..1c39ce031a 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -145,6 +145,7 @@ typedef enum
 	WAIT_EVENT_PG_SLEEP,
 	WAIT_EVENT_RECOVERY_APPLY_DELAY,
 	WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
+	WAIT_EVENT_REGISTER_SYNC_REQUEST,
 	WAIT_EVENT_VACUUM_DELAY,
 	WAIT_EVENT_VACUUM_TRUNCATE
 } WaitEventTimeout;
-- 
2.30.2

From 50060e5a0ed66762680ddee9e30acbad905c6e98 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 1 Mar 2022 17:34:43 +1300
Subject: [PATCH v2 3/3] Use condition variable to wait when sync request queue
 is full.

Previously, in the (hopefully) rare case that we need to wait for the
checkpointer to create space in the sync request queue, we'd enter a
sleep/retry loop.  Instead, create a condition variable so the
checkpointer can wake us up whenever there is a transition from 'full'
to 'not full'.

Discussion:  https://postgr.es/m/20220226213942.nb7uvb2pamyu26dj%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml            | 10 +++---
 src/backend/postmaster/checkpointer.c   | 38 +++++++++++++++++++---
 src/backend/storage/sync/sync.c         | 42 ++++++++++---------------
 src/backend/utils/activity/wait_event.c |  6 ++--
 src/include/postmaster/bgwriter.h       |  4 ++-
 src/include/utils/wait_event.h          |  2 +-
 6 files changed, 62 insertions(+), 40 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 71559442f0..5e11548075 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1626,6 +1626,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for activity from a child process while
        executing a <literal>Gather</literal> plan node.</entry>
      </row>
+     <row>
+      <entry><literal>ForwardSyncRequest</literal></entry>
+      <entry>Waiting while sending synchronization requests to the
+       checkpointer, because the request queue is full.</entry>
+     </row>
      <row>
       <entry><literal>HashBatchAllocate</literal></entry>
       <entry>Waiting for an elected Parallel Hash participant to allocate a hash
@@ -2254,11 +2259,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting during recovery when WAL data is not available from any
        source (<filename>pg_wal</filename>, archive or stream).</entry>
      </row>
-     <row>
-      <entry><literal>RegisterSyncRequest</literal></entry>
-      <entry>Waiting while sending synchronization requests to the
-       checkpointer, because the request queue is full.</entry>
-     </row>
      <row>
       <entry><literal>VacuumDelay</literal></entry>
       <entry>Waiting in a cost-based vacuum delay point.</entry>
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index a59c3cf020..16cf472f90 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -128,6 +128,7 @@ typedef struct
 	uint32		num_backend_writes; /* counts user backend buffer writes */
 	uint32		num_backend_fsync;	/* counts user backend fsync calls */
 
+	ConditionVariable requests_not_full_cv;	/* signaled when space available */
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
@@ -903,6 +904,7 @@ CheckpointerShmemInit(void)
 		CheckpointerShmem->max_requests = NBuffers;
 		ConditionVariableInit(&CheckpointerShmem->start_cv);
 		ConditionVariableInit(&CheckpointerShmem->done_cv);
+		ConditionVariableInit(&CheckpointerShmem->requests_not_full_cv);
 	}
 }
 
@@ -1076,10 +1078,11 @@ RequestCheckpoint(int flags)
  * to perform its own fsync, which is far more expensive in practice.  It
  * is theoretically possible a backend fsync might still be necessary, if
  * the queue is full and contains no duplicate entries.  In that case, we
- * let the backend know by returning false.
+ * let the backend know by returning false, or if 'wait' is true, then we
+ * wait for space to become available.
  */
 bool
-ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
+ForwardSyncRequest(const FileTag *ftag, SyncRequestType type, bool wait)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
@@ -1101,9 +1104,9 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 	 * backend will have to perform its own fsync request.  But before forcing
 	 * that to happen, we can try to compact the request queue.
 	 */
-	if (CheckpointerShmem->checkpointer_pid == 0 ||
-		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
-		 !CompactCheckpointerRequestQueue()))
+	if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
+		!CompactCheckpointerRequestQueue() &&
+		!wait)
 	{
 		/*
 		 * Count the subset of writes where backends have to do their own
@@ -1115,6 +1118,24 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		return false;
 	}
 
+	/*
+	 * If we still don't have enough space and the caller asked us to wait,
+	 * wait for the checkpointer to advertise that there is space.
+	 */
+	if (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests)
+	{
+		ConditionVariablePrepareToSleep(&CheckpointerShmem->requests_not_full_cv);
+		while (CheckpointerShmem->num_requests >=
+			   CheckpointerShmem->max_requests)
+		{
+			LWLockRelease(CheckpointerCommLock);
+			ConditionVariableSleep(&CheckpointerShmem->requests_not_full_cv,
+								   WAIT_EVENT_FORWARD_SYNC_REQUEST);
+			LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
+		}
+		ConditionVariableCancelSleep();
+	}
+
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
 	request->ftag = *ftag;
@@ -1261,6 +1282,7 @@ AbsorbSyncRequests(void)
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
 	int			n;
+	bool		was_full;
 
 	if (!AmCheckpointerProcess())
 		return;
@@ -1293,6 +1315,8 @@ AbsorbSyncRequests(void)
 		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
 	}
 
+	was_full = n >= CheckpointerShmem->max_requests;
+
 	START_CRIT_SECTION();
 
 	CheckpointerShmem->num_requests = 0;
@@ -1304,6 +1328,10 @@ AbsorbSyncRequests(void)
 
 	END_CRIT_SECTION();
 
+	/* Wake anyone waiting for space in ForwardSyncRequest(). */
+	if (was_full)
+		ConditionVariableBroadcast(&CheckpointerShmem->requests_not_full_cv);
+
 	if (requests)
 		pfree(requests);
 }
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 0c4d9ce687..d0fca02ee6 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -585,31 +585,23 @@ RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
 		return true;
 	}
 
-	for (;;)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue a message in
-		 * retryOnError mode, we have to sleep and try again ... ugly, but
-		 * hopefully won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error in the case of SYNC_UNLINK_REQUEST would leave the
-		 * no-longer-used file still present on disk, which would be bad, so
-		 * I'm inclined to assume that the checkpointer will always empty the
-		 * queue soon.
-		 */
-		ret = ForwardSyncRequest(ftag, type);
-
-		/*
-		 * If we are successful in queueing the request, or we failed and were
-		 * instructed not to retry on error, break.
-		 */
-		if (ret || (!ret && !retryOnError))
-			break;
-
-		WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 10,
-				  WAIT_EVENT_REGISTER_SYNC_REQUEST);
-	}
+	/*
+	 * Notify the checkpointer about it.  If we fail to queue a message in
+	 * retryOnError mode, we wait until space is available ... ugly, but
+	 * hopefully won't happen often.
+	 *
+	 * Don't allow interrupts while waiting.  Escaping with an error in the
+	 * case of SYNC_UNLINK_REQUEST would leave the no-longer-used file still
+	 * present on disk, which would be bad, so I'm inclined to assume that the
+	 * checkpointer will always empty the queue soon.
+	 *
+	 * XXX This concern would go away along with SYNC_UNLINK_REQUEST if we
+	 * figure out how to get rid of 'tombstone files'.  That would be a good
+	 * idea because holding interrupts is bad for ProcSignalBarrier.
+	 */
+	HOLD_INTERRUPTS();
+	ret = ForwardSyncRequest(ftag, type, retryOnError /* wait */);
+	RESUME_INTERRUPTS();
 
 	return ret;
 }
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index ff46a0e3c7..15c35691b6 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -346,6 +346,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_EXECUTE_GATHER:
 			event_name = "ExecuteGather";
 			break;
+		case WAIT_EVENT_FORWARD_SYNC_REQUEST:
+			event_name = "ForwardSyncRequest";
+			break;
 		case WAIT_EVENT_HASH_BATCH_ALLOCATE:
 			event_name = "HashBatchAllocate";
 			break;
@@ -497,9 +500,6 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL:
 			event_name = "RecoveryRetrieveRetryInterval";
 			break;
-		case WAIT_EVENT_REGISTER_SYNC_REQUEST:
-			event_name = "RegisterSyncRequest";
-			break;
 		case WAIT_EVENT_VACUUM_DELAY:
 			event_name = "VacuumDelay";
 			break;
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 2882efd67b..fa82007fc0 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -33,7 +33,9 @@ extern void CheckpointerMain(void) pg_attribute_noreturn();
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
+extern bool ForwardSyncRequest(const FileTag *ftag,
+							   SyncRequestType type,
+							   bool wait);
 
 extern void AbsorbSyncRequests(void);
 
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 1c39ce031a..3b07c3b878 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -91,6 +91,7 @@ typedef enum
 	WAIT_EVENT_CHECKPOINT_DONE,
 	WAIT_EVENT_CHECKPOINT_START,
 	WAIT_EVENT_EXECUTE_GATHER,
+	WAIT_EVENT_FORWARD_SYNC_REQUEST,
 	WAIT_EVENT_HASH_BATCH_ALLOCATE,
 	WAIT_EVENT_HASH_BATCH_ELECT,
 	WAIT_EVENT_HASH_BATCH_LOAD,
@@ -145,7 +146,6 @@ typedef enum
 	WAIT_EVENT_PG_SLEEP,
 	WAIT_EVENT_RECOVERY_APPLY_DELAY,
 	WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
-	WAIT_EVENT_REGISTER_SYNC_REQUEST,
 	WAIT_EVENT_VACUUM_DELAY,
 	WAIT_EVENT_VACUUM_TRUNCATE
 } WaitEventTimeout;
-- 
2.30.2

Re: Checkpointer sync queue fills up / loops around pg_usleep() are bad

Reply via email to