[HACKERS] Spread checkpoint sync

Greg Smith Sun, 14 Nov 2010 15:48:57 -0800

Final patch in this series for today spreads out the individualcheckpoint fsync calls over time, and was written by myself and SimonRiggs. Patch is based against a system that's already had the twopatches I sent over earlier today applied, rather than HEAD, as both areuseful for measuring how well this one works. You can grab a tree withall three from my Github repo, via the "checkpoint" branch:https://github.com/greg2ndQuadrant/postgres/tree/checkpoint

This is a work in progress. While I've seen this reduce checkpointspike latency significantly on a large system, I don't have anyreferencable performance numbers I can share yet. There are also acouple of problems I know about, and I'm sure others I haven't thoughtof yet The first known issues is that it delays manual or other"forced" checkpoints, which is not necessarily wrong if you really areserious about spreading syncs out, but it is certainly surprising whenyou run into it. I notice this most when running createdb on a busysystem. No real reason for this to happen, the code passes that it's aforced checkpoint down but just doesn't act on it yet.

The second issue is that the delay between sync calls is currentlyhard-coded, at 3 seconds. I believe the right path here is to considerthe current checkpoint_completion_target to still be valid, then workback from there. That raises the question of what percentage of thetime writes should now be compressed into relative to that, to leavesome time to spread the sync calls. If we're willing to say "writesfinish in first 1/2 of target, syncs execute in second 1/2", that Icould implement that here. Maybe that ratio needs to be anothertunable. Still thinking about that part, and it's certainly open tocommunity debate. The thing to realize that complicates the design isthat the actual sync execution may take a considerable period of time.It's much more likely for that to happen than in the case of anindividual write, as the current spread checkpoint does, because thoseare usually cached. In the spread sync case, it's easy for one slowsync to make the rest turn into ones that fire in quick succession, tomake up for lost time.

There's some history behind this design that impacts review. Circa 8.3development in 2007, I had experimented with putting some delay betweeneach of the fsync calls that the background writer executes during acheckpoint. It didn't help smooth things out at all at the time. Itturns out that's mainly because all my tests were on Linux using ext3.On that filesystem, fsync is not very granular. It's quite likely itwill push out data you haven't asked to sync yet, which means one giantsync is almost impossible to avoid no matter how you space the fsynccalls. If you try and review this on ext3, I expect you'll find a bigspike early in each checkpoint (where it flushes just about everythingout) and then quick response for the later files involved.

The system this patch originated to help fix was running XFS. There,I've confirmed that problem doesn't exist, that individual syncs onlyseem to push out the data related to one file. The same should be trueon ext4, but I haven't tested that myself. Not sure how granular thefsync calls are on Solaris, FreeBSD, Darwin, etc. yet. Note that it'sstill possible to get hung on one sync call for a while, even on XFS.The worst case seems to be if you've created a new 1GB database tablechunk and fully populated it since the last checkpoint, on a systemthat's just cached the whole thing so far.

One change that turned out be necessary rather than optional--to getgood performance from the system under tuning--was to make regularbackground writer activity, including fsync absorb checks, happen duringthese sync pauses. The existing code ran the checkpoint sync work in apretty tight loop, which as I alluded to in an earlier patch today canlead to the backends competing with the background writer to get theirsync calls executed. This squashes that problem if the backgroundwriter is setup properly.

What does properly mean? Well, it can't do that cleanup if thebackground writer is sleeping. This whole area was refactored. Thecurrent sync absorb code uses the constant WRITES_PER_ABSORB to makedecisions. This new version replaces that hard-coded value withsomething that scales to the system size. It now ignores doing workuntil the number of pending absorb requests has reached 10% of thenumber possible to store (BgWriterShmem->max_requests, which is set tothe size of shared_buffers in 8K pages, AKA NBuffers). This mayactually postpone this work for too long on systems with largeshared_buffers settings; that's one area I'm still investigating.

As far as concerns about this 10% setting not doing enough work, whichis something I do see, you can always increase how often absorbinghappens by decreasing bgwriter_delay now--giving other benefits too.For example, if you run the fsync-stress-v2.sh script I included withthe last patch I sent, you'll discover the spread sync version of theserver leaves just as many unabsorbed writes behind as the old codedid. Those are happening because of periods the background writer issleeping. They drop as you decrease the delay; here's a table showingsome values I tested here, with all three patches installed:


bgwriter_delay    buffers_backend_sync
200 ms    90
50 ms    28
25 ms    3

There's a bunch of performance related review work that needs to be donehere, in addition to the usual code review for the patch. My hope isthat I can get enough of that done to validate this does what it'ssupposed to on public hardware that a later version of this patch isconsidered for the next CommitFest. It's a little more raw than I'dlike still, but the idea has been tested enough here that I believe it'sfundamentally sound and valuable.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 43a149e..0ce8e2b 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -143,8 +143,8 @@ typedef struct
 
 static BgWriterShmemStruct *BgWriterShmem;
 
-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB		1000
+/* Fraction of fsync absorb queue that needs to be filled before acting */
+#define ABSORB_ACTION_DIVISOR	10
 
 /*
  * GUC parameters
@@ -382,7 +382,7 @@ BackgroundWriterMain(void)
 		/*
 		 * Process any requests or signals received recently.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbFsyncRequests(false);
 
 		if (got_SIGHUP)
 		{
@@ -636,7 +636,7 @@ BgWriterNap(void)
 		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
 			break;
 		pg_usleep(1000000L);
-		AbsorbFsyncRequests();
+		AbsorbFsyncRequests(true);
 		udelay -= 1000000L;
 	}
 
@@ -684,8 +684,6 @@ ImmediateCheckpointRequested(void)
 void
 CheckpointWriteDelay(int flags, double progress)
 {
-	static int	absorb_counter = WRITES_PER_ABSORB;
-
 	/* Do nothing if checkpoint is being executed by non-bgwriter process */
 	if (!am_bg_writer)
 		return;
@@ -705,22 +703,65 @@ CheckpointWriteDelay(int flags, double progress)
 			ProcessConfigFile(PGC_SIGHUP);
 		}
 
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
+		AbsorbFsyncRequests(false);
 
 		BgBufferSync();
 		CheckArchiveTimeout();
 		BgWriterNap();
 	}
-	else if (--absorb_counter <= 0)
+	else
 	{
 		/*
-		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
-		 * operations even when we don't sleep, to prevent overflow of the
-		 * fsync request queue.
+		 * Check for overflow of the fsync request queue.
 		 */
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
+		AbsorbFsyncRequests(false);
+	}
+}
+
+/*
+ * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint
+ *
+ * This function is called after each file sync performed by mdsync().
+ * It is responsible for keeping the bgwriter's normal activities in
+ * progress during a long checkpoint.
+ */
+void
+CheckpointSyncDelay(void)
+{
+	pg_time_t	now;
+ 	pg_time_t	sync_start_time;
+ 	int			sync_delay_secs;
+ 
+ 	/*
+ 	 * Delay after each sync, in seconds.  This could be a parameter.  But
+ 	 * since ideally this will be auto-tuning in the near future, not
+	 * assigning it a GUC setting yet.
+ 	 */
+#define EXTRA_SYNC_DELAY	3
+
+	/* Do nothing if checkpoint is being executed by non-bgwriter process */
+	if (!am_bg_writer)
+		return;
+
+ 	sync_start_time = (pg_time_t) time(NULL);
+
+	/*
+	 * Perform the usual bgwriter duties.
+	 */
+ 	for (;;)
+ 	{
+		AbsorbFsyncRequests(false);
+ 		BgBufferSync();
+ 		CheckArchiveTimeout();
+ 		BgWriterNap();
+ 
+ 		/*
+ 		 * Are we there yet?
+ 		 */
+ 		now = (pg_time_t) time(NULL);
+ 		sync_delay_secs = now - sync_start_time;
+ 		if (sync_delay_secs >= EXTRA_SYNC_DELAY)
+			break;
 	}
 }
 
@@ -1116,16 +1157,41 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
  * non-bgwriter processes, do nothing if not bgwriter.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbFsyncRequests(bool force)
 {
 	BgWriterRequest *requests = NULL;
 	BgWriterRequest *request;
 	int			n;
 
+	/* 
+	 * Divide the size of the request queue by this to determine when
+	 * absorption action needs to be taken.  Default here aims to empty the
+	 * queue whenever 1 / 10 = 10% of it is full.  If this isn't good enough,
+	 * you probably need to lower bgwriter_delay, rather than presume
+	 * this needs to be a tunable you can decrease.
+	 */
+	int			absorb_action_divisor = 10;
+
 	if (!am_bg_writer)
 		return;
 
 	/*
+	 * If the queue isn't very large, don't worry about absorbing yet.
+	 * Access integer counter without lock, to avoid queuing.
+	 */
+	if (!force && BgWriterShmem->num_requests < 
+			(BgWriterShmem->max_requests / ABSORB_ACTION_DIVISOR))
+	{	
+		if (BgWriterShmem->num_requests > 0)
+			elog(DEBUG1,"Absorb queue: %d fsync requests, not processing",
+				BgWriterShmem->num_requests);
+		return;
+	}
+
+	elog(DEBUG1,"Absorb queue: %d fsync requests, processing",
+		BgWriterShmem->num_requests);
+
+	/*
 	 * We have to PANIC if we fail to absorb all the pending requests (eg,
 	 * because our hashtable runs out of memory).  This is because the system
 	 * cannot run safely if we are unable to fsync what we have been told to
@@ -1167,4 +1233,9 @@ AbsorbFsyncRequests(void)
 		pfree(requests);
 
 	END_CRIT_SECTION();
+
+	/*
+	 * Send off activity statistics to the stats collector
+	 */
+	pgstat_send_bgwriter();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 7140b94..57066c4 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -36,9 +36,6 @@
  */
 #define DEBUG_FSYNC	1
 
-/* interval for calling AbsorbFsyncRequests in mdsync */
-#define FSYNCS_PER_ABSORB		10
-
 /* special values for the segno arg to RememberFsyncRequest */
 #define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
 #define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
@@ -931,7 +928,6 @@ mdsync(void)
 
 	HASH_SEQ_STATUS hstat;
 	PendingOperationEntry *entry;
-	int			absorb_counter;
 
 #ifdef DEBUG_FSYNC
 	/* Statistics on sync times */
@@ -958,7 +954,7 @@ mdsync(void)
 	 * queued an fsync request before clearing the buffer's dirtybit, so we
 	 * are safe as long as we do an Absorb after completing BufferSync().
 	 */
-	AbsorbFsyncRequests();
+	AbsorbFsyncRequests(true);
 
 	/*
 	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
@@ -1001,7 +997,6 @@ mdsync(void)
 	mdsync_in_progress = true;
 
 	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOpsTable);
 	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 	{
@@ -1026,17 +1021,9 @@ mdsync(void)
 			int			failures;
 
 			/*
-			 * If in bgwriter, we want to absorb pending requests every so
-			 * often to prevent overflow of the fsync request queue.  It is
-			 * unspecified whether newly-added entries will be visited by
-			 * hash_seq_search, but we don't care since we don't need to
-			 * process them anyway.
+			 * If in bgwriter, perform normal duties.
 			 */
-			if (--absorb_counter <= 0)
-			{
-				AbsorbFsyncRequests();
-				absorb_counter = FSYNCS_PER_ABSORB;
-			}
+			CheckpointSyncDelay();
 
 			/*
 			 * The fsync table could contain requests to fsync segments that
@@ -1131,10 +1118,9 @@ mdsync(void)
 				pfree(path);
 
 				/*
-				 * Absorb incoming requests and check to see if canceled.
+				 * If in bgwriter, perform normal duties.
 				 */
-				AbsorbFsyncRequests();
-				absorb_counter = FSYNCS_PER_ABSORB;		/* might as well... */
+				CheckpointSyncDelay();
 
 				if (entry->canceled)
 					break;
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index e251da6..4939604 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -26,10 +26,11 @@ extern void BackgroundWriterMain(void);
 
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointSyncDelay(void);
 
 extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
 					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern void AbsorbFsyncRequests(bool force);
 
 extern Size BgWriterShmemSize(void);
 extern void BgWriterShmemInit(void);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Spread checkpoint sync

Reply via email to