On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <g...@2ndquadrant.com> wrote: > I've attached an updated version of the initial sync spreading patch here, > one that applies cleanly on top of HEAD and over top of the sync > instrumentation patch too. The conflict that made that hard before is gone > now.
With the fsync queue compaction patch applied, I think most of this is now not needed. Attached please find an attempt to isolate the portion that looks like it might still be useful. The basic idea of what remains here is to make the background writer still do its normal stuff even when it's checkpointing. In particular, with this patch applied, PG will: 1. Absorb fsync requests a lot more often during the sync phase. 2. Still try to run the cleaning scan during the sync phase. 3. Pause for 3 seconds after every fsync. I suspect that #1 is probably a good idea. It seems pretty clear based on your previous testing that the fsync compaction patch should be sufficient to prevent us from hitting the wall, but if we're going to any kind of nontrivial work here then cleaning the queue is a sensible thing to do along the way, and there's little downside. I also suspect #2 is a good idea. The fact that we're checkpointing doesn't mean that the system suddenly doesn't require clean buffers, and the experimentation I've done recently (see: limiting hint bit I/O) convinces me that it's pretty expensive from a performance standpoint when backends have to start writing out their own buffers, so continuing to do that work during the sync phase of a checkpoint, just as we do during the write phase, seems pretty sensible. I think something along the lines of #3 is probably a good idea, but the current coding doesn't take checkpoint_completion_target into account. The underlying problem here is that it's at least somewhat reasonable to assume that if we write() a whole bunch of blocks, each write() will take approximately the same amount of time. But this is not true at all with respect to fsync() - they neither take the same amount of time as each other, nor is there any fixed ratio between write() time and fsync() time to go by. So if we want the checkpoint to finish in, say, 20 minutes, we can't know whether the write phase needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59. One idea I have is to try to get some of the fsyncs out of the queue at times other than end-of-checkpoint. Even if this resulted in some modest increase in the total number of fsync() calls, it might improve performance by causing data to be flushed to disk in smaller chunks. For example, suppose we kept an LRU list of pending fsync requests - every time we remember an fsync request for a particular relation, we move it to the head (hot end) of the LRU. And periodically we pull the tail entry off the list and fsync it - say, after checkpoint_timeout / (# of items in the list). That way, when we arrive at the end of the checkpoint and starting syncing everything, the syncs hopefully complete more quickly because we've already forced a bunch of the data down to disk. That algorithm may well be too stupid or just not work in real life, but perhaps there's some variation that would be sensible. The point is: instead of or in addition to trying to spread out the sync phase, we might want to investigate whether it's possible to reduce its size. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 4df69c2..36da084 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -726,6 +726,53 @@ CheckpointWriteDelay(int flags, double progress) } /* + * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint + * + * This function is called after each file sync performed by mdsync(). + * It is responsible for keeping the bgwriter's normal activities in + * progress during a long checkpoint. + */ +void +CheckpointSyncDelay(void) +{ + pg_time_t now; + pg_time_t sync_start_time; + int sync_delay_secs; + + /* + * Delay after each sync, in seconds. This could be a parameter. But + * since ideally this will be auto-tuning in the near future, not + * assigning it a GUC setting yet. + */ +#define EXTRA_SYNC_DELAY 3 + + /* Do nothing if checkpoint is being executed by non-bgwriter process */ + if (!am_bg_writer) + return; + + sync_start_time = (pg_time_t) time(NULL); + + /* + * Perform the usual bgwriter duties. + */ + for (;;) + { + AbsorbFsyncRequests(); + BgBufferSync(); + CheckArchiveTimeout(); + BgWriterNap(); + + /* + * Are we there yet? + */ + now = (pg_time_t) time(NULL); + sync_delay_secs = now - sync_start_time; + if (sync_delay_secs >= EXTRA_SYNC_DELAY) + break; + } +} + +/* * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint * in time? * diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 9d585b6..4de0243 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -31,9 +31,6 @@ #include "pg_trace.h" -/* interval for calling AbsorbFsyncRequests in mdsync */ -#define FSYNCS_PER_ABSORB 10 - /* * Special values for the segno arg to RememberFsyncRequest. * @@ -932,7 +929,6 @@ mdsync(void) HASH_SEQ_STATUS hstat; PendingOperationEntry *entry; - int absorb_counter; /* Statistics on sync times */ int processed = 0; @@ -1002,7 +998,6 @@ mdsync(void) mdsync_in_progress = true; /* Now scan the hashtable for fsync requests to process */ - absorb_counter = FSYNCS_PER_ABSORB; hash_seq_init(&hstat, pendingOpsTable); while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) { @@ -1027,17 +1022,9 @@ mdsync(void) int failures; /* - * If in bgwriter, we want to absorb pending requests every so - * often to prevent overflow of the fsync request queue. It is - * unspecified whether newly-added entries will be visited by - * hash_seq_search, but we don't care since we don't need to - * process them anyway. + * If in bgwriter, perform normal duties. */ - if (--absorb_counter <= 0) - { - AbsorbFsyncRequests(); - absorb_counter = FSYNCS_PER_ABSORB; - } + CheckpointSyncDelay(); /* * The fsync table could contain requests to fsync segments that @@ -1131,10 +1118,9 @@ mdsync(void) pfree(path); /* - * Absorb incoming requests and check to see if canceled. + * If in bgwriter, perform normal duties. */ - AbsorbFsyncRequests(); - absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */ + CheckpointSyncDelay(); if (entry->canceled) break; diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h index eaf2206..6e285cd 100644 --- a/src/include/postmaster/bgwriter.h +++ b/src/include/postmaster/bgwriter.h @@ -26,6 +26,7 @@ extern void BackgroundWriterMain(void); extern void RequestCheckpoint(int flags); extern void CheckpointWriteDelay(int flags, double progress); +extern void CheckpointSyncDelay(void); extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum, BlockNumber segno);
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers