On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <g...@2ndquadrant.com> wrote: > Having the pg_stat_bgwriter.buffers_backend_fsync patch available all the > time now has made me reconsider how important one potential bit of > refactoring here would be. I managed to catch one of the situations where > really popular relations were being heavily updated in a way that was > competing with the checkpoint on my test system (which I can happily share > the logs of), with the instrumentation patch applied but not the spread sync > one: > > LOG: checkpoint starting: xlog > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 7747 of relation base/16424/16442 > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 42688 of relation base/16424/16437 > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 9723 of relation base/16424/16442 > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 58117 of relation base/16424/16437 > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 165128 of relation base/16424/16437 > [330 of these total, all referring to the same two relations] > > DEBUG: checkpoint sync: number=1 file=base/16424/16448_fsm > time=10132.830000 msec > DEBUG: checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec > DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec > DEBUG: checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec > DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec > DEBUG: checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec > DEBUG: checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec > DEBUG: checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000 > msec > DEBUG: checkpoint sync: number=9 file=base/16424/16437_fsm time=0.001000 > msec > DEBUG: checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec > DEBUG: checkpoint sync: number=11 file=base/16424/16425 time=0.000000 msec > DEBUG: checkpoint sync: number=12 file=base/16424/16437_vm time=0.001000 > msec > DEBUG: checkpoint sync: number=13 file=base/16424/16425_vm time=0.001000 > msec > LOG: checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log > file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s, > total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s > > Note here how the checkpoint was hung on trying to get 16448_fsm written > out, but the backends were issuing constant competing fsync calls to these > other relations. This is very similar to the production case this patch was > written to address, which I hadn't been able to share a good example of yet. > That's essentially what it looks like, except with the contention going on > for minutes instead of seconds. > > One of the ideas Simon and I had been considering at one point was adding > some better de-duplication logic to the fsync absorb code, which I'm > reminded by the pattern here might be helpful independently of other > improvements.
Hopefully I'm not stepping on any toes here, but I thought this was an awfully good idea and had a chance to take a look at how hard it would be today while en route from point A to point B. The answer turned out to be "not very", so PFA a patch that seems to work. I tested it by attaching gdb to the background writer while running pgbench, and it eliminate the backend fsyncs without even breaking a sweat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 4457df6..f6cd8dc 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -182,6 +182,7 @@ static void CheckArchiveTimeout(void); static void BgWriterNap(void); static bool IsCheckpointOnSchedule(double progress); static bool ImmediateCheckpointRequested(void); +static bool CompactBgwriterRequestQueue(void); /* Signal handlers */ @@ -1090,10 +1091,20 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum, /* Count all backend writes regardless of if they fit in the queue */ BgWriterShmem->num_backend_writes++; + /* + * If the background writer isn't running or the request queue is full, + * the backend will have to perform its own fsync request. But before + * forcing that to happen, we can try to compact the background writer + * request queue. + */ if (BgWriterShmem->bgwriter_pid == 0 || - BgWriterShmem->num_requests >= BgWriterShmem->max_requests) + (BgWriterShmem->num_requests >= BgWriterShmem->max_requests + && !CompactBgwriterRequestQueue())) { - /* Also count the subset where backends have to do their own fsync */ + /* + * Count the subset of writes where backends have to do their own + * fsync + */ BgWriterShmem->num_backend_fsync++; LWLockRelease(BgWriterCommLock); return false; @@ -1107,6 +1118,101 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum, } /* + * CompactBgwriterRequestQueue + * Remove duplicates from the request queue to avoid backend fsyncs. + * + * Trying to do this every time the queue is full could lose if there aren't + * any removable entries. But that's not very likely: there's one queue entry + * per shared buffer. + */ +static bool +CompactBgwriterRequestQueue() +{ + struct BgWriterSlotMapping { + BgWriterRequest request; + int slot; + }; + + int n, + preserve_count, + num_skipped; + HASHCTL ctl; + HTAB *htab; + bool *skip_slot; + + /* must hold BgWriterCommLock in exclusive mode */ + Assert(LWLockHeldByMe(BgWriterCommLock)); + + /* Initialize temporary hash table */ + MemSet(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(BgWriterRequest); + ctl.entrysize = sizeof(struct BgWriterSlotMapping); + ctl.hash = tag_hash; + htab = hash_create("CompactBgwriterRequestQueue", + BgWriterShmem->num_requests, + &ctl, + HASH_ELEM | HASH_FUNCTION); + + /* Initialize skip_slot array */ + skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests); + + /* + * The basic idea here is that a request can be skipped if it's followed + * by a later, identical request. It might seem more sensible to work + * backwards from the end of the queue and check whether a request is + * *preceded* by an earlier, identical request, in the hopes of doing less + * copying. But that might change the semantics, if there's an + * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so + * we do it this way. It would be possible to be even smarter if we made + * the code below understand the specific semantics of such requests (it + * could blow away preceding entries that would end up being cancelled + * anyhow), but it's not clear that the extra complexity would buy us + * anything. + */ + for (n = 0; n < BgWriterShmem->num_requests; ++n) + { + BgWriterRequest *request; + struct BgWriterSlotMapping *slotmap; + bool found; + + request = &BgWriterShmem->requests[n]; + slotmap = hash_search(htab, request, HASH_ENTER, &found); + if (found) + { + skip_slot[slotmap->slot] = true; + ++num_skipped; + } + slotmap->slot = n; + } + + /* Done with the hash table. */ + hash_destroy(htab); + + /* If no duplicates, we're out of luck. */ + if (!num_skipped) + { + pfree(skip_slot); + return false; + } + + /* Hooray, we found some duplicates! Remove them. */ + for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n) + { + if (skip_slot[n]) + continue; + BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n]; + } + ereport(DEBUG1, + (errmsg("compacted fsync request queue from %d entries to %d entries", + BgWriterShmem->num_requests, preserve_count))); + BgWriterShmem->num_requests = preserve_count; + + /* Cleanup. */ + pfree(skip_slot); + return true; +} + +/* * AbsorbFsyncRequests * Retrieve queued fsync requests and pass them to local smgr. *
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers