Re: [HACKERS] Spread checkpoint sync

Robert Haas Tue, 11 Jan 2011 17:27:56 -0800

On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <[email protected]> wrote:
> Having the pg_stat_bgwriter.buffers_backend_fsync patch available all the
> time now has made me reconsider how important one potential bit of
> refactoring here would be.  I managed to catch one of the situations where
> really popular relations were being heavily updated in a way that was
> competing with the checkpoint on my test system (which I can happily share
> the logs of), with the instrumentation patch applied but not the spread sync
> one:
>
> LOG:  checkpoint starting: xlog
> DEBUG:  could not forward fsync request because request queue is full
> CONTEXT:  writing block 7747 of relation base/16424/16442
> DEBUG:  could not forward fsync request because request queue is full
> CONTEXT:  writing block 42688 of relation base/16424/16437
> DEBUG:  could not forward fsync request because request queue is full
> CONTEXT:  writing block 9723 of relation base/16424/16442
> DEBUG:  could not forward fsync request because request queue is full
> CONTEXT:  writing block 58117 of relation base/16424/16437
> DEBUG:  could not forward fsync request because request queue is full
> CONTEXT:  writing block 165128 of relation base/16424/16437
> [330 of these total, all referring to the same two relations]
>
> DEBUG:  checkpoint sync: number=1 file=base/16424/16448_fsm
> time=10132.830000 msec
> DEBUG:  checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec
> DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec
> DEBUG:  checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec
> DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec
> DEBUG:  checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec
> DEBUG:  checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec
> DEBUG:  checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000
> msec
> DEBUG:  checkpoint sync: number=9 file=base/16424/16437_fsm time=0.001000
> msec
> DEBUG:  checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec
> DEBUG:  checkpoint sync: number=11 file=base/16424/16425 time=0.000000 msec
> DEBUG:  checkpoint sync: number=12 file=base/16424/16437_vm time=0.001000
> msec
> DEBUG:  checkpoint sync: number=13 file=base/16424/16425_vm time=0.001000
> msec
> LOG:  checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log
> file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s,
> total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s
>
> Note here how the checkpoint was hung on trying to get 16448_fsm written
> out, but the backends were issuing constant competing fsync calls to these
> other relations.  This is very similar to the production case this patch was
> written to address, which I hadn't been able to share a good example of yet.
>  That's essentially what it looks like, except with the contention going on
> for minutes instead of seconds.
>
> One of the ideas Simon and I had been considering at one point was adding
> some better de-duplication logic to the fsync absorb code, which I'm
> reminded by the pattern here might be helpful independently of other
> improvements.


Hopefully I'm not stepping on any toes here, but I thought this was an
awfully good idea and had a chance to take a look at how hard it would
be today while en route from point A to point B.  The answer turned
out to be "not very", so PFA a patch that seems to work.  I tested it
by attaching gdb to the background writer while running pgbench, and
it eliminate the backend fsyncs without even breaking a sweat.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 4457df6..f6cd8dc 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -182,6 +182,7 @@ static void CheckArchiveTimeout(void);
 static void BgWriterNap(void);
 static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
+static bool CompactBgwriterRequestQueue(void);
 
 /* Signal handlers */
 
@@ -1090,10 +1091,20 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
 	/* Count all backend writes regardless of if they fit in the queue */
 	BgWriterShmem->num_backend_writes++;
 
+	/*
+	 * If the background writer isn't running or the request queue is full,
+	 * the backend will have to perform its own fsync request.  But before
+	 * forcing that to happen, we can try to compact the background writer
+	 * request queue.
+	 */
 	if (BgWriterShmem->bgwriter_pid == 0 ||
-		BgWriterShmem->num_requests >= BgWriterShmem->max_requests)
+		(BgWriterShmem->num_requests >= BgWriterShmem->max_requests
+		&& !CompactBgwriterRequestQueue()))
 	{
-		/* Also count the subset where backends have to do their own fsync */
+		/*
+		 * Count the subset of writes where backends have to do their own
+		 * fsync
+		 */
 		BgWriterShmem->num_backend_fsync++;
 		LWLockRelease(BgWriterCommLock);
 		return false;
@@ -1107,6 +1118,101 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
 }
 
 /*
+ * CompactBgwriterRequestQueue
+ * 		Remove duplicates from the request queue to avoid backend fsyncs.
+ *
+ * Trying to do this every time the queue is full could lose if there aren't
+ * any removable entries.  But that's not very likely: there's one queue entry
+ * per shared buffer.
+ */
+static bool
+CompactBgwriterRequestQueue()
+{
+	struct BgWriterSlotMapping {
+		BgWriterRequest	request;
+		int		slot;
+	};
+
+	int			n,
+				preserve_count,
+				num_skipped;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	bool	   *skip_slot;
+
+	/* must hold BgWriterCommLock in exclusive mode */
+	Assert(LWLockHeldByMe(BgWriterCommLock));
+
+	/* Initialize temporary hash table */
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(BgWriterRequest);
+	ctl.entrysize = sizeof(struct BgWriterSlotMapping);
+	ctl.hash = tag_hash;
+	htab = hash_create("CompactBgwriterRequestQueue",
+					   BgWriterShmem->num_requests,
+					   &ctl,
+					   HASH_ELEM | HASH_FUNCTION);
+
+	/* Initialize skip_slot array */
+	skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
+
+	/* 
+	 * The basic idea here is that a request can be skipped if it's followed
+	 * by a later, identical request.  It might seem more sensible to work
+	 * backwards from the end of the queue and check whether a request is
+	 * *preceded* by an earlier, identical request, in the hopes of doing less
+	 * copying.  But that might change the semantics, if there's an
+	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
+	 * we do it this way.  It would be possible to be even smarter if we made
+	 * the code below understand the specific semantics of such requests (it
+	 * could blow away preceding entries that would end up being cancelled
+	 * anyhow), but it's not clear that the extra complexity would buy us
+	 * anything.
+	 */
+	for (n = 0; n < BgWriterShmem->num_requests; ++n)
+	{
+		BgWriterRequest *request;
+		struct BgWriterSlotMapping *slotmap;
+		bool	found;
+
+		request = &BgWriterShmem->requests[n];
+		slotmap = hash_search(htab, request, HASH_ENTER, &found);
+		if (found)
+		{
+			skip_slot[slotmap->slot] = true;
+			++num_skipped;
+		}
+		slotmap->slot = n;
+	}
+
+	/* Done with the hash table. */
+	hash_destroy(htab);
+
+	/* If no duplicates, we're out of luck. */
+	if (!num_skipped)
+	{
+		pfree(skip_slot);
+		return false;
+	}
+
+	/* Hooray, we found some duplicates! Remove them. */
+	for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n)
+	{
+		if (skip_slot[n])
+			continue;
+		BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n];
+	}
+	ereport(DEBUG1,
+			(errmsg("compacted fsync request queue from %d entries to %d entries",
+				BgWriterShmem->num_requests, preserve_count)));
+	BgWriterShmem->num_requests = preserve_count;
+
+	/* Cleanup. */
+	pfree(skip_slot);
+	return true;
+}
+
+/*
  * AbsorbFsyncRequests
  *		Retrieve queued fsync requests and pass them to local smgr.
  *

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

Reply via email to