Last year at this point, I submitted an increasingly complicated
checkpoint sync spreading feature. I wasn't able to prove any
repeatable drop in sync time latency from those patches. While that was
going on, and continuing into recently, the production server that
started all this with its sync time latency issues didn't stop having
that problem. Data collection continued, new patches were tried.
There was a really simple triage step Simon and I made before getting
into the complicated ones: just delay for a few seconds between every
single sync call made during a checkpoint. That approach is still
hanging around that server's patched PostgreSQL package set, and it
still works better than anything more complicated we've tried so far.
The recent split of background writer and checkpointer makes that whole
thing even easier to do without rippling out to have unexpected
consequences.
In order to be able to tune this usefully, you need to know information
about how many files a typical checkpoint syncs. That could be
available without needing log scraping using the "Publish checkpoint
timing and sync files summary data to pg_stat_bgwriter" addition I just
submitted. People who set this new checkpoint_sync_pause value too high
can face checkpoints running over schedule, but you can measure how bad
your exposure is with the new view information.
I owe the community a lot of data to prove this is useful before I'd
expect it to be taken seriously. I was planning to leave this whole
area alone until 9.3. But since recent submissions may pull me back
into trying various ways of rearranging the write path for 9.2, I wanted
to have my own miniature horse in that race. It works simply:
...
2012-01-16 02:39:01.184 EST [25052]: DEBUG: checkpoint sync: number=34
file=base/16385/11766 time=0.006 msec
2012-01-16 02:39:01.184 EST [25052]: DEBUG: checkpoint sync delay:
seconds left=3
2012-01-16 02:39:01.284 EST [25052]: DEBUG: checkpoint sync delay:
seconds left=2
2012-01-16 02:39:01.385 EST [25052]: DEBUG: checkpoint sync delay:
seconds left=1
2012-01-16 02:39:01.860 EST [25052]: DEBUG: checkpoint sync: number=35
file=global/12007 time=375.710 msec
2012-01-16 02:39:01.860 EST [25052]: DEBUG: checkpoint sync delay:
seconds left=3
2012-01-16 02:39:01.961 EST [25052]: DEBUG: checkpoint sync delay:
seconds left=2
2012-01-16 02:39:02.061 EST [25052]: DEBUG: checkpoint sync delay:
seconds left=1
2012-01-16 02:39:02.161 EST [25052]: DEBUG: checkpoint sync: number=36
file=base/16385/11754 time=0.008 msec
2012-01-16 02:39:02.555 EST [25052]: LOG: checkpoint complete: wrote
2586 buffers (63.1%); 1 transaction log file(s) added, 0 removed, 0
recycled; write=2.422 s, sync=13.282 s, total=16.123 s; sync files=36,
longest=1.085 s, average=0.040 s
No docs yet, really need a better guide to tuning checkpoints as they
exist now before there's a place to attach a discussion of this to.
--
Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0b792d2..54da69a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -142,6 +142,7 @@ static BgWriterShmemStruct *BgWriterShmem;
int CheckPointTimeout = 300;
int CheckPointWarning = 30;
double CheckPointCompletionTarget = 0.5;
+int CheckPointSyncPause = 0;
/*
* Flags set by interrupt handlers for later service in the main loop.
@@ -157,6 +158,8 @@ static bool am_checkpointer = false;
static bool ckpt_active = false;
+static int checkpoint_flags = 0;
+
/* these values are valid when ckpt_active is true: */
static pg_time_t ckpt_start_time;
static XLogRecPtr ckpt_start_recptr;
@@ -643,6 +646,9 @@ CheckpointWriteDelay(int flags, double progress)
if (!am_checkpointer)
return;
+ /* Cache this value for a later sync delay */
+ checkpoint_flags=flags;
+
/*
* Perform the usual duties and take a nap, unless we're behind
* schedule, in which case we just try to catch up as quickly as possible.
@@ -685,6 +691,72 @@ CheckpointWriteDelay(int flags, double progress)
}
/*
+ * CheckpointSyncDelay -- control rate of checkpoint sync stage
+ *
+ * This function is called after each relation sync performed by mdsync().
+ * It delays for a fixed period while still making sure to absorb
+ * incoming fsync requests.
+ *
+ * Due to where this is called with the md layer, it's not practical
+ * for it to be directly passed the checkpoint flags. It's expected
+ * they will have been stashed within the checkpointer's local state
+ * by a call to CheckpointWriteDelay.
+ *
+ */
+void
+CheckpointSyncDelay()
+{
+ static int absorb_counter = WRITES_PER_ABSORB;
+ int sync_delay_secs = CheckPointSyncPause;
+
+ /* Do nothing if checkpoint is being executed by non-checkpointer process */
+ if (!am_checkpointer)
+ return;
+
+ /*
+ * Perform the usual duties and take a nap if there's time left
+ */
+ while (!(checkpoint_flags & CHECKPOINT_IMMEDIATE) &&
+ !shutdown_requested &&
+ !ImmediateCheckpointRequested() &&
+ (sync_delay_secs > 0))
+ {
+ elog(DEBUG2,"checkpoint sync delay: seconds left=%d",sync_delay_secs);
+ if (got_SIGHUP)
+ {
+ got_SIGHUP = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ /* update global shmem state for sync rep */
+ SyncRepUpdateSyncStandbysDefined();
+ }
+
+ AbsorbFsyncRequests();
+ absorb_counter = WRITES_PER_ABSORB;
+
+ CheckArchiveTimeout();
+
+ /*
+ * Checkpoint sleep used to be connected to bgwriter_delay at 200ms.
+ * That resulted in more frequent wakeups if not much work to do.
+ * Checkpointer and bgwriter are no longer related so take the Big Sleep.
+ */
+ pg_usleep(100000L);
+ sync_delay_secs--;
+ }
+
+ if (--absorb_counter <= 0)
+ {
+ /*
+ * Absorb pending fsync requests after each WRITES_PER_ABSORB write
+ * operations even when we don't sleep, to prevent overflow of the
+ * fsync request queue.
+ */
+ AbsorbFsyncRequests();
+ absorb_counter = WRITES_PER_ABSORB;
+ }
+}
+
+/*
* IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
* in time?
*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index bfc9f06..dd63535 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1047,6 +1047,8 @@ mdsync(void)
absorb_counter = FSYNCS_PER_ABSORB;
}
+ CheckpointSyncDelay();
+
/*
* The fsync table could contain requests to fsync segments that
* have been deleted (unlinked) by the time we get to them. Rather
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5c910dd..6c856c1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1975,6 +1975,16 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"checkpoint_sync_pause", PGC_SIGHUP, WAL_CHECKPOINTS,
+ gettext_noop("Inserts a delay after each checkpoint file sync operation"),
+ NULL
+ },
+ &CheckPointSyncPause,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 315db46..5fc6476 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -180,6 +180,7 @@
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
+#checkpoint_sync_pause = 0 # in seconds
# - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1ddf4bf..1736ba6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,7 @@ extern bool reachedConsistency;
/* these variables are GUC parameters related to XLOG */
extern int CheckPointSegments;
+extern int CheckPointSyncPause;
extern int wal_keep_segments;
extern int XLOGbuffers;
extern int XLogArchiveTimeout;
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 6cc4b62..4d57b4a 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -21,12 +21,14 @@ extern int BgWriterDelay;
extern int CheckPointTimeout;
extern int CheckPointWarning;
extern double CheckPointCompletionTarget;
+extern int CheckPointSyncPause;
extern void BackgroundWriterMain(void);
extern void CheckpointerMain(void);
extern void RequestCheckpoint(int flags);
extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointSyncDelay();
extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
BlockNumber segno);
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers