Final patch in this series for today spreads out the individual checkpoint fsync calls over time, and was written by myself and Simon Riggs. Patch is based against a system that's already had the two patches I sent over earlier today applied, rather than HEAD, as both are useful for measuring how well this one works. You can grab a tree with all three from my Github repo, via the "checkpoint" branch: https://github.com/greg2ndQuadrant/postgres/tree/checkpoint

This is a work in progress. While I've seen this reduce checkpoint spike latency significantly on a large system, I don't have any referencable performance numbers I can share yet. There are also a couple of problems I know about, and I'm sure others I haven't thought of yet The first known issues is that it delays manual or other "forced" checkpoints, which is not necessarily wrong if you really are serious about spreading syncs out, but it is certainly surprising when you run into it. I notice this most when running createdb on a busy system. No real reason for this to happen, the code passes that it's a forced checkpoint down but just doesn't act on it yet.

The second issue is that the delay between sync calls is currently hard-coded, at 3 seconds. I believe the right path here is to consider the current checkpoint_completion_target to still be valid, then work back from there. That raises the question of what percentage of the time writes should now be compressed into relative to that, to leave some time to spread the sync calls. If we're willing to say "writes finish in first 1/2 of target, syncs execute in second 1/2", that I could implement that here. Maybe that ratio needs to be another tunable. Still thinking about that part, and it's certainly open to community debate. The thing to realize that complicates the design is that the actual sync execution may take a considerable period of time. It's much more likely for that to happen than in the case of an individual write, as the current spread checkpoint does, because those are usually cached. In the spread sync case, it's easy for one slow sync to make the rest turn into ones that fire in quick succession, to make up for lost time.

There's some history behind this design that impacts review. Circa 8.3 development in 2007, I had experimented with putting some delay between each of the fsync calls that the background writer executes during a checkpoint. It didn't help smooth things out at all at the time. It turns out that's mainly because all my tests were on Linux using ext3. On that filesystem, fsync is not very granular. It's quite likely it will push out data you haven't asked to sync yet, which means one giant sync is almost impossible to avoid no matter how you space the fsync calls. If you try and review this on ext3, I expect you'll find a big spike early in each checkpoint (where it flushes just about everything out) and then quick response for the later files involved.

The system this patch originated to help fix was running XFS. There, I've confirmed that problem doesn't exist, that individual syncs only seem to push out the data related to one file. The same should be true on ext4, but I haven't tested that myself. Not sure how granular the fsync calls are on Solaris, FreeBSD, Darwin, etc. yet. Note that it's still possible to get hung on one sync call for a while, even on XFS. The worst case seems to be if you've created a new 1GB database table chunk and fully populated it since the last checkpoint, on a system that's just cached the whole thing so far.

One change that turned out be necessary rather than optional--to get good performance from the system under tuning--was to make regular background writer activity, including fsync absorb checks, happen during these sync pauses. The existing code ran the checkpoint sync work in a pretty tight loop, which as I alluded to in an earlier patch today can lead to the backends competing with the background writer to get their sync calls executed. This squashes that problem if the background writer is setup properly.

What does properly mean? Well, it can't do that cleanup if the background writer is sleeping. This whole area was refactored. The current sync absorb code uses the constant WRITES_PER_ABSORB to make decisions. This new version replaces that hard-coded value with something that scales to the system size. It now ignores doing work until the number of pending absorb requests has reached 10% of the number possible to store (BgWriterShmem->max_requests, which is set to the size of shared_buffers in 8K pages, AKA NBuffers). This may actually postpone this work for too long on systems with large shared_buffers settings; that's one area I'm still investigating.

As far as concerns about this 10% setting not doing enough work, which is something I do see, you can always increase how often absorbing happens by decreasing bgwriter_delay now--giving other benefits too. For example, if you run the fsync-stress-v2.sh script I included with the last patch I sent, you'll discover the spread sync version of the server leaves just as many unabsorbed writes behind as the old code did. Those are happening because of periods the background writer is sleeping. They drop as you decrease the delay; here's a table showing some values I tested here, with all three patches installed:

bgwriter_delay    buffers_backend_sync
200 ms    90
50 ms    28
25 ms    3

There's a bunch of performance related review work that needs to be done here, in addition to the usual code review for the patch. My hope is that I can get enough of that done to validate this does what it's supposed to on public hardware that a later version of this patch is considered for the next CommitFest. It's a little more raw than I'd like still, but the idea has been tested enough here that I believe it's fundamentally sound and valuable.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us


diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 43a149e..0ce8e2b 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -143,8 +143,8 @@ typedef struct
 
 static BgWriterShmemStruct *BgWriterShmem;
 
-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB		1000
+/* Fraction of fsync absorb queue that needs to be filled before acting */
+#define ABSORB_ACTION_DIVISOR	10
 
 /*
  * GUC parameters
@@ -382,7 +382,7 @@ BackgroundWriterMain(void)
 		/*
 		 * Process any requests or signals received recently.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbFsyncRequests(false);
 
 		if (got_SIGHUP)
 		{
@@ -636,7 +636,7 @@ BgWriterNap(void)
 		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
 			break;
 		pg_usleep(1000000L);
-		AbsorbFsyncRequests();
+		AbsorbFsyncRequests(true);
 		udelay -= 1000000L;
 	}
 
@@ -684,8 +684,6 @@ ImmediateCheckpointRequested(void)
 void
 CheckpointWriteDelay(int flags, double progress)
 {
-	static int	absorb_counter = WRITES_PER_ABSORB;
-
 	/* Do nothing if checkpoint is being executed by non-bgwriter process */
 	if (!am_bg_writer)
 		return;
@@ -705,22 +703,65 @@ CheckpointWriteDelay(int flags, double progress)
 			ProcessConfigFile(PGC_SIGHUP);
 		}
 
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
+		AbsorbFsyncRequests(false);
 
 		BgBufferSync();
 		CheckArchiveTimeout();
 		BgWriterNap();
 	}
-	else if (--absorb_counter <= 0)
+	else
 	{
 		/*
-		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
-		 * operations even when we don't sleep, to prevent overflow of the
-		 * fsync request queue.
+		 * Check for overflow of the fsync request queue.
 		 */
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
+		AbsorbFsyncRequests(false);
+	}
+}
+
+/*
+ * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint
+ *
+ * This function is called after each file sync performed by mdsync().
+ * It is responsible for keeping the bgwriter's normal activities in
+ * progress during a long checkpoint.
+ */
+void
+CheckpointSyncDelay(void)
+{
+	pg_time_t	now;
+ 	pg_time_t	sync_start_time;
+ 	int			sync_delay_secs;
+ 
+ 	/*
+ 	 * Delay after each sync, in seconds.  This could be a parameter.  But
+ 	 * since ideally this will be auto-tuning in the near future, not
+	 * assigning it a GUC setting yet.
+ 	 */
+#define EXTRA_SYNC_DELAY	3
+
+	/* Do nothing if checkpoint is being executed by non-bgwriter process */
+	if (!am_bg_writer)
+		return;
+
+ 	sync_start_time = (pg_time_t) time(NULL);
+
+	/*
+	 * Perform the usual bgwriter duties.
+	 */
+ 	for (;;)
+ 	{
+		AbsorbFsyncRequests(false);
+ 		BgBufferSync();
+ 		CheckArchiveTimeout();
+ 		BgWriterNap();
+ 
+ 		/*
+ 		 * Are we there yet?
+ 		 */
+ 		now = (pg_time_t) time(NULL);
+ 		sync_delay_secs = now - sync_start_time;
+ 		if (sync_delay_secs >= EXTRA_SYNC_DELAY)
+			break;
 	}
 }
 
@@ -1116,16 +1157,41 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
  * non-bgwriter processes, do nothing if not bgwriter.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbFsyncRequests(bool force)
 {
 	BgWriterRequest *requests = NULL;
 	BgWriterRequest *request;
 	int			n;
 
+	/* 
+	 * Divide the size of the request queue by this to determine when
+	 * absorption action needs to be taken.  Default here aims to empty the
+	 * queue whenever 1 / 10 = 10% of it is full.  If this isn't good enough,
+	 * you probably need to lower bgwriter_delay, rather than presume
+	 * this needs to be a tunable you can decrease.
+	 */
+	int			absorb_action_divisor = 10;
+
 	if (!am_bg_writer)
 		return;
 
 	/*
+	 * If the queue isn't very large, don't worry about absorbing yet.
+	 * Access integer counter without lock, to avoid queuing.
+	 */
+	if (!force && BgWriterShmem->num_requests < 
+			(BgWriterShmem->max_requests / ABSORB_ACTION_DIVISOR))
+	{	
+		if (BgWriterShmem->num_requests > 0)
+			elog(DEBUG1,"Absorb queue: %d fsync requests, not processing",
+				BgWriterShmem->num_requests);
+		return;
+	}
+
+	elog(DEBUG1,"Absorb queue: %d fsync requests, processing",
+		BgWriterShmem->num_requests);
+
+	/*
 	 * We have to PANIC if we fail to absorb all the pending requests (eg,
 	 * because our hashtable runs out of memory).  This is because the system
 	 * cannot run safely if we are unable to fsync what we have been told to
@@ -1167,4 +1233,9 @@ AbsorbFsyncRequests(void)
 		pfree(requests);
 
 	END_CRIT_SECTION();
+
+	/*
+	 * Send off activity statistics to the stats collector
+	 */
+	pgstat_send_bgwriter();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 7140b94..57066c4 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -36,9 +36,6 @@
  */
 #define DEBUG_FSYNC	1
 
-/* interval for calling AbsorbFsyncRequests in mdsync */
-#define FSYNCS_PER_ABSORB		10
-
 /* special values for the segno arg to RememberFsyncRequest */
 #define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
 #define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
@@ -931,7 +928,6 @@ mdsync(void)
 
 	HASH_SEQ_STATUS hstat;
 	PendingOperationEntry *entry;
-	int			absorb_counter;
 
 #ifdef DEBUG_FSYNC
 	/* Statistics on sync times */
@@ -958,7 +954,7 @@ mdsync(void)
 	 * queued an fsync request before clearing the buffer's dirtybit, so we
 	 * are safe as long as we do an Absorb after completing BufferSync().
 	 */
-	AbsorbFsyncRequests();
+	AbsorbFsyncRequests(true);
 
 	/*
 	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
@@ -1001,7 +997,6 @@ mdsync(void)
 	mdsync_in_progress = true;
 
 	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOpsTable);
 	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 	{
@@ -1026,17 +1021,9 @@ mdsync(void)
 			int			failures;
 
 			/*
-			 * If in bgwriter, we want to absorb pending requests every so
-			 * often to prevent overflow of the fsync request queue.  It is
-			 * unspecified whether newly-added entries will be visited by
-			 * hash_seq_search, but we don't care since we don't need to
-			 * process them anyway.
+			 * If in bgwriter, perform normal duties.
 			 */
-			if (--absorb_counter <= 0)
-			{
-				AbsorbFsyncRequests();
-				absorb_counter = FSYNCS_PER_ABSORB;
-			}
+			CheckpointSyncDelay();
 
 			/*
 			 * The fsync table could contain requests to fsync segments that
@@ -1131,10 +1118,9 @@ mdsync(void)
 				pfree(path);
 
 				/*
-				 * Absorb incoming requests and check to see if canceled.
+				 * If in bgwriter, perform normal duties.
 				 */
-				AbsorbFsyncRequests();
-				absorb_counter = FSYNCS_PER_ABSORB;		/* might as well... */
+				CheckpointSyncDelay();
 
 				if (entry->canceled)
 					break;
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index e251da6..4939604 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -26,10 +26,11 @@ extern void BackgroundWriterMain(void);
 
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointSyncDelay(void);
 
 extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
 					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern void AbsorbFsyncRequests(bool force);
 
 extern Size BgWriterShmemSize(void);
 extern void BgWriterShmemInit(void);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to