On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <g...@2ndquadrant.com> wrote:
> I've attached an updated version of the initial sync spreading patch here,
> one that applies cleanly on top of HEAD and over top of the sync
> instrumentation patch too.  The conflict that made that hard before is gone
> now.

With the fsync queue compaction patch applied, I think most of this is
now not needed.  Attached please find an attempt to isolate the
portion that looks like it might still be useful.  The basic idea of
what remains here is to make the background writer still do its normal
stuff even when it's checkpointing.  In particular, with this patch
applied, PG will:

1. Absorb fsync requests a lot more often during the sync phase.
2. Still try to run the cleaning scan during the sync phase.
3. Pause for 3 seconds after every fsync.

I suspect that #1 is probably a good idea.  It seems pretty clear
based on your previous testing that the fsync compaction patch should
be sufficient to prevent us from hitting the wall, but if we're going
to any kind of nontrivial work here then cleaning the queue is a
sensible thing to do along the way, and there's little downside.

I also suspect #2 is a good idea.  The fact that we're checkpointing
doesn't mean that the system suddenly doesn't require clean buffers,
and the experimentation I've done recently (see: limiting hint bit
I/O) convinces me that it's pretty expensive from a performance
standpoint when backends have to start writing out their own buffers,
so continuing to do that work during the sync phase of a checkpoint,
just as we do during the write phase, seems pretty sensible.

I think something along the lines of #3 is probably a good idea, but
the current coding doesn't take checkpoint_completion_target into
account.  The underlying problem here is that it's at least somewhat
reasonable to assume that if we write() a whole bunch of blocks, each
write() will take approximately the same amount of time.  But this is
not true at all with respect to fsync() - they neither take the same
amount of time as each other, nor is there any fixed ratio between
write() time and fsync() time to go by.  So if we want the checkpoint
to finish in, say, 20 minutes, we can't know whether the write phase
needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59.

One idea I have is to try to get some of the fsyncs out of the queue
at times other than end-of-checkpoint.  Even if this resulted in some
modest increase in the total number of fsync() calls, it might improve
performance by causing data to be flushed to disk in smaller chunks.
For example, suppose we kept an LRU list of pending fsync requests -
every time we remember an fsync request for a particular relation, we
move it to the head (hot end) of the LRU.  And periodically we pull
the tail entry off the list and fsync it - say, after
checkpoint_timeout / (# of items in the list).  That way, when we
arrive at the end of the checkpoint and starting syncing everything,
the syncs hopefully complete more quickly because we've already forced
a bunch of the data down to disk.  That algorithm may well be too
stupid or just not work in real life, but perhaps there's some
variation that would be sensible.  The point is: instead of or in
addition to trying to spread out the sync phase, we might want to
investigate whether it's possible to reduce its size.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 4df69c2..36da084 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -726,6 +726,53 @@ CheckpointWriteDelay(int flags, double progress)
 }
 
 /*
+ * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint
+ *
+ * This function is called after each file sync performed by mdsync().
+ * It is responsible for keeping the bgwriter's normal activities in
+ * progress during a long checkpoint.
+ */
+void
+CheckpointSyncDelay(void)
+{
+	pg_time_t	now;
+ 	pg_time_t	sync_start_time;
+ 	int			sync_delay_secs;
+ 
+ 	/*
+ 	 * Delay after each sync, in seconds.  This could be a parameter.  But
+ 	 * since ideally this will be auto-tuning in the near future, not
+	 * assigning it a GUC setting yet.
+ 	 */
+#define EXTRA_SYNC_DELAY	3
+
+	/* Do nothing if checkpoint is being executed by non-bgwriter process */
+	if (!am_bg_writer)
+		return;
+
+ 	sync_start_time = (pg_time_t) time(NULL);
+
+	/*
+	 * Perform the usual bgwriter duties.
+	 */
+ 	for (;;)
+ 	{
+		AbsorbFsyncRequests();
+ 		BgBufferSync();
+ 		CheckArchiveTimeout();
+ 		BgWriterNap();
+ 
+ 		/*
+ 		 * Are we there yet?
+ 		 */
+ 		now = (pg_time_t) time(NULL);
+ 		sync_delay_secs = now - sync_start_time;
+ 		if (sync_delay_secs >= EXTRA_SYNC_DELAY)
+			break;
+	}
+}
+
+/*
  * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
  *		 in time?
  *
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 9d585b6..4de0243 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,9 +31,6 @@
 #include "pg_trace.h"
 
 
-/* interval for calling AbsorbFsyncRequests in mdsync */
-#define FSYNCS_PER_ABSORB		10
-
 /*
  * Special values for the segno arg to RememberFsyncRequest.
  *
@@ -932,7 +929,6 @@ mdsync(void)
 
 	HASH_SEQ_STATUS hstat;
 	PendingOperationEntry *entry;
-	int			absorb_counter;
 
 	/* Statistics on sync times */
 	int			processed = 0;
@@ -1002,7 +998,6 @@ mdsync(void)
 	mdsync_in_progress = true;
 
 	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOpsTable);
 	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 	{
@@ -1027,17 +1022,9 @@ mdsync(void)
 			int			failures;
 
 			/*
-			 * If in bgwriter, we want to absorb pending requests every so
-			 * often to prevent overflow of the fsync request queue.  It is
-			 * unspecified whether newly-added entries will be visited by
-			 * hash_seq_search, but we don't care since we don't need to
-			 * process them anyway.
+			 * If in bgwriter, perform normal duties.
 			 */
-			if (--absorb_counter <= 0)
-			{
-				AbsorbFsyncRequests();
-				absorb_counter = FSYNCS_PER_ABSORB;
-			}
+			CheckpointSyncDelay();
 
 			/*
 			 * The fsync table could contain requests to fsync segments that
@@ -1131,10 +1118,9 @@ mdsync(void)
 				pfree(path);
 
 				/*
-				 * Absorb incoming requests and check to see if canceled.
+				 * If in bgwriter, perform normal duties.
 				 */
-				AbsorbFsyncRequests();
-				absorb_counter = FSYNCS_PER_ABSORB;		/* might as well... */
+				CheckpointSyncDelay();
 
 				if (entry->canceled)
 					break;
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index eaf2206..6e285cd 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -26,6 +26,7 @@ extern void BackgroundWriterMain(void);
 
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointSyncDelay(void);
 
 extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
 					BlockNumber segno);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to