Last year at this point, I submitted an increasingly complicated checkpoint sync spreading feature. I wasn't able to prove any repeatable drop in sync time latency from those patches. While that was going on, and continuing into recently, the production server that started all this with its sync time latency issues didn't stop having that problem. Data collection continued, new patches were tried.

There was a really simple triage step Simon and I made before getting into the complicated ones: just delay for a few seconds between every single sync call made during a checkpoint. That approach is still hanging around that server's patched PostgreSQL package set, and it still works better than anything more complicated we've tried so far. The recent split of background writer and checkpointer makes that whole thing even easier to do without rippling out to have unexpected consequences.

In order to be able to tune this usefully, you need to know information about how many files a typical checkpoint syncs. That could be available without needing log scraping using the "Publish checkpoint timing and sync files summary data to pg_stat_bgwriter" addition I just submitted. People who set this new checkpoint_sync_pause value too high can face checkpoints running over schedule, but you can measure how bad your exposure is with the new view information.

I owe the community a lot of data to prove this is useful before I'd expect it to be taken seriously. I was planning to leave this whole area alone until 9.3. But since recent submissions may pull me back into trying various ways of rearranging the write path for 9.2, I wanted to have my own miniature horse in that race. It works simply:

...
2012-01-16 02:39:01.184 EST [25052]: DEBUG: checkpoint sync: number=34 file=base/16385/11766 time=0.006 msec 2012-01-16 02:39:01.184 EST [25052]: DEBUG: checkpoint sync delay: seconds left=3 2012-01-16 02:39:01.284 EST [25052]: DEBUG: checkpoint sync delay: seconds left=2 2012-01-16 02:39:01.385 EST [25052]: DEBUG: checkpoint sync delay: seconds left=1 2012-01-16 02:39:01.860 EST [25052]: DEBUG: checkpoint sync: number=35 file=global/12007 time=375.710 msec 2012-01-16 02:39:01.860 EST [25052]: DEBUG: checkpoint sync delay: seconds left=3 2012-01-16 02:39:01.961 EST [25052]: DEBUG: checkpoint sync delay: seconds left=2 2012-01-16 02:39:02.061 EST [25052]: DEBUG: checkpoint sync delay: seconds left=1 2012-01-16 02:39:02.161 EST [25052]: DEBUG: checkpoint sync: number=36 file=base/16385/11754 time=0.008 msec 2012-01-16 02:39:02.555 EST [25052]: LOG: checkpoint complete: wrote 2586 buffers (63.1%); 1 transaction log file(s) added, 0 removed, 0 recycled; write=2.422 s, sync=13.282 s, total=16.123 s; sync files=36, longest=1.085 s, average=0.040 s

No docs yet, really need a better guide to tuning checkpoints as they exist now before there's a place to attach a discussion of this to.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0b792d2..54da69a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -142,6 +142,7 @@ static BgWriterShmemStruct *BgWriterShmem;
 int			CheckPointTimeout = 300;
 int			CheckPointWarning = 30;
 double		CheckPointCompletionTarget = 0.5;
+int			CheckPointSyncPause = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -157,6 +158,8 @@ static bool am_checkpointer = false;
 
 static bool ckpt_active = false;
 
+static int checkpoint_flags = 0;
+
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
 static XLogRecPtr ckpt_start_recptr;
@@ -643,6 +646,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!am_checkpointer)
 		return;
 
+ 	/* Cache this value for a later sync delay */
+ 	checkpoint_flags=flags;
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind
 	 * schedule, in which case we just try to catch up as quickly as possible.
@@ -685,6 +691,72 @@ CheckpointWriteDelay(int flags, double progress)
 }
 
 /*
+ * CheckpointSyncDelay -- control rate of checkpoint sync stage
+ *
+ * This function is called after each relation sync performed by mdsync().
+ * It delays for a fixed period while still making sure to absorb
+ * incoming fsync requests.
+ * 
+ * Due to where this is called with the md layer, it's not practical
+ * for it to be directly passed the checkpoint flags.  It's expected
+ * they will have been stashed within the checkpointer's local state
+ * by a call to CheckpointWriteDelay.
+ *
+ */
+void
+CheckpointSyncDelay()
+{
+	static int	absorb_counter = WRITES_PER_ABSORB;
+ 	int			sync_delay_secs = CheckPointSyncPause;
+ 
+	/* Do nothing if checkpoint is being executed by non-checkpointer process */
+	if (!am_checkpointer)
+		return;
+
+	/*
+	 * Perform the usual duties and take a nap if there's time left
+	 */
+	while (!(checkpoint_flags & CHECKPOINT_IMMEDIATE) &&
+		!shutdown_requested &&
+		!ImmediateCheckpointRequested() &&
+		(sync_delay_secs > 0))
+	{
+ 		elog(DEBUG2,"checkpoint sync delay: seconds left=%d",sync_delay_secs);
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+			/* update global shmem state for sync rep */
+			SyncRepUpdateSyncStandbysDefined();
+		}
+
+		AbsorbFsyncRequests();
+		absorb_counter = WRITES_PER_ABSORB;
+
+		CheckArchiveTimeout();
+
+		/*
+		 * Checkpoint sleep used to be connected to bgwriter_delay at 200ms.
+		 * That resulted in more frequent wakeups if not much work to do.
+		 * Checkpointer and bgwriter are no longer related so take the Big Sleep.
+		 */
+		pg_usleep(100000L);
+		sync_delay_secs--;
+	}
+
+	if (--absorb_counter <= 0)
+	{
+		/*
+		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
+		 * operations even when we don't sleep, to prevent overflow of the
+		 * fsync request queue.
+		 */
+		AbsorbFsyncRequests();
+		absorb_counter = WRITES_PER_ABSORB;
+	}
+}
+
+/*
  * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
  *		 in time?
  *
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index bfc9f06..dd63535 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1047,6 +1047,8 @@ mdsync(void)
 				absorb_counter = FSYNCS_PER_ABSORB;
 			}
 
+			CheckpointSyncDelay();
+
 			/*
 			 * The fsync table could contain requests to fsync segments that
 			 * have been deleted (unlinked) by the time we get to them. Rather
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5c910dd..6c856c1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1975,6 +1975,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"checkpoint_sync_pause", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Inserts a delay after each checkpoint file sync operation"),
+			NULL
+		},
+		&CheckPointSyncPause,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 315db46..5fc6476 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -180,6 +180,7 @@
 #checkpoint_timeout = 5min		# range 30s-1h
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_warning = 30s		# 0 disables
+#checkpoint_sync_pause = 0      # in seconds
 
 # - Archiving -
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1ddf4bf..1736ba6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,7 @@ extern bool reachedConsistency;
 
 /* these variables are GUC parameters related to XLOG */
 extern int	CheckPointSegments;
+extern int	CheckPointSyncPause;
 extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 6cc4b62..4d57b4a 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -21,12 +21,14 @@ extern int	BgWriterDelay;
 extern int	CheckPointTimeout;
 extern int	CheckPointWarning;
 extern double CheckPointCompletionTarget;
+extern int	CheckPointSyncPause;
 
 extern void BackgroundWriterMain(void);
 extern void CheckpointerMain(void);
 
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointSyncDelay();
 
 extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
 					BlockNumber segno);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to