On Thu, Feb 23, 2012 at 11:17 AM, Greg Smith <g...@2ndquadrant.com> wrote:
> A second fact that's visible from the TPS graphs over the test run, and > obvious if you think about it, is that BGW writes force data to physical > disk earlier than they otherwise might go there. That's a subtle pattern in > the graphs. I expect that though, given one element to "do I write this?" > in Linux is how old the write is. Wondering about this really emphasises > that I need to either add graphing of vmstat/iostat data to these graphs or > switch to a benchmark that does that already. I think I've got just enough > people using pgbench-tools to justify the feature even if I plan to use the > program less. For me, that is the key point. For the test being performed there is no value in things being written earlier, since doing so merely overexercises the I/O. We should note that there is no feedback process in the bgwriter to do writes only when the level of dirty writes by backends is high enough to warrant the activity. Note that Linux has a demand paging algorithm, it doesn't just clean all of the time. That's the reason you still see some swapping, because that activity is what wakes the pager. We don't count the number of dirty writes by backends, we just keep cleaning even when nobody wants it. Earlier, I pointed out that bgwriter is being woken any time a user marks a buffer dirty. That is overkill. The bgwriter should stay asleep until a threshold number (TBD) of dirty writes is reached, then it should wake up and do some cleaning. Having a continuously active bgwriter is pointless, for some workloads whereas for others, it helps. So having a sleeping bgwriter isn't just a power management issue its a performance issue in some cases. /* * Even in cases where there's been little or no buffer allocation * activity, we want to make a small amount of progress through the buffer * cache so that as many reusable buffers as possible are clean after an * idle period. * * (scan_whole_pool_milliseconds / BgWriterDelay) computes how many times * the BGW will be called during the scan_whole_pool time; slice the * buffer pool into that many sections. */ Since scan_whole_pool_milliseconds is set to 2 minutes we scan the whole bufferpool every 2 minutes, no matter how big the bufferpool, even when nothing else is happening. Not cool. I think it would be sensible to have bgwriter stop when 10% of shared_buffers are clean, rather than keep going even when no dirty writes are happening. So my suggestion is that we put in an additional clause into BgBufferSync() to allow min_scan_buffers to fall to zero when X% of shared buffers is clean. After that bgwriter should sleep. And be woken again only by a dirty write by a user backend. That sounds like clean ratio will flip between 0 and X% but first dirty write will occur long before we git zero, so that will cause bgwriter to attempt to maintain a reasonably steady state clean ratio. I would also take a wild guess that the 750 results are due to freelist contention. To assess that, I post again the patch shown on other threads designed to assess the overall level of freelist lwlock contention. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 3e62448..36b0160 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -17,6 +17,7 @@ #include "storage/buf_internals.h" #include "storage/bufmgr.h" +#include "utils/timestamp.h" /* @@ -41,6 +42,21 @@ typedef struct */ uint32 completePasses; /* Complete cycles of the clock sweep */ uint32 numBufferAllocs; /* Buffers allocated since last reset */ + + /* + * Wait Statistics + */ + long waitBufferAllocSecs; + int waitBufferAllocUSecs; + int waitBufferAlloc; + + long waitBufferFreeSecs; + int waitBufferFreeUSecs; + int waitBufferFree; + + long waitSyncStartSecs; + int waitSyncStartUSecs; + int waitSyncStart; } BufferStrategyControl; /* Pointers to shared state */ @@ -125,7 +141,29 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) /* Nope, so lock the freelist */ *lock_held = true; - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); + if (!LWLockConditionalAcquire(BufFreelistLock, LW_EXCLUSIVE)) + { + TimestampTz waitStart = GetCurrentTimestamp(); + TimestampTz waitEnd; + long wait_secs; + int wait_usecs; + + LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); + + waitEnd = GetCurrentTimestamp(); + + TimestampDifference(waitStart, waitEnd, + &wait_secs, &wait_usecs); + + StrategyControl->waitBufferAllocSecs += wait_secs; + StrategyControl->waitBufferAllocUSecs += wait_usecs; + if (StrategyControl->waitBufferAllocUSecs > 1000000) + { + StrategyControl->waitBufferAllocUSecs -= 1000000; + StrategyControl->waitBufferAllocSecs += 1; + } + StrategyControl->waitBufferAlloc++; + } /* * We count buffer allocation requests so that the bgwriter can estimate @@ -223,7 +261,29 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) void StrategyFreeBuffer(volatile BufferDesc *buf) { - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); + if (!LWLockConditionalAcquire(BufFreelistLock, LW_EXCLUSIVE)) + { + TimestampTz waitStart = GetCurrentTimestamp(); + TimestampTz waitEnd; + long wait_secs; + int wait_usecs; + + LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); + + waitEnd = GetCurrentTimestamp(); + + TimestampDifference(waitStart, waitEnd, + &wait_secs, &wait_usecs); + + StrategyControl->waitBufferFreeSecs += wait_secs; + StrategyControl->waitBufferFreeUSecs += wait_usecs; + if (StrategyControl->waitBufferFreeUSecs > 1000000) + { + StrategyControl->waitBufferFreeUSecs -= 1000000; + StrategyControl->waitBufferFreeSecs += 1; + } + StrategyControl->waitBufferFree++; + } /* * It is possible that we are told to put something in the freelist that @@ -256,7 +316,30 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) { int result; - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); + if (!LWLockConditionalAcquire(BufFreelistLock, LW_EXCLUSIVE)) + { + TimestampTz waitStart = GetCurrentTimestamp(); + TimestampTz waitEnd; + long wait_secs; + int wait_usecs; + + LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); + + waitEnd = GetCurrentTimestamp(); + + TimestampDifference(waitStart, waitEnd, + &wait_secs, &wait_usecs); + + StrategyControl->waitSyncStartSecs += wait_secs; + StrategyControl->waitSyncStartUSecs += wait_usecs; + if (StrategyControl->waitSyncStartUSecs > 1000000) + { + StrategyControl->waitSyncStartUSecs -= 1000000; + StrategyControl->waitSyncStartSecs += 1; + } + StrategyControl->waitSyncStart++; + } + result = StrategyControl->nextVictimBuffer; if (complete_passes) *complete_passes = StrategyControl->completePasses; @@ -265,7 +348,59 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) *num_buf_alloc = StrategyControl->numBufferAllocs; StrategyControl->numBufferAllocs = 0; } + else + { + long waitBufferAllocSecs; + int waitBufferAllocUSecs; + int waitBufferAlloc; + + long waitBufferFreeSecs; + int waitBufferFreeUSecs; + int waitBufferFree; + + long waitSyncStartSecs; + int waitSyncStartUSecs; + int waitSyncStart; + + waitBufferAllocSecs = StrategyControl->waitBufferAllocSecs; + waitBufferAllocUSecs = StrategyControl->waitBufferAllocUSecs; + waitBufferAlloc = StrategyControl->waitBufferAlloc; + + waitBufferFreeSecs = StrategyControl->waitBufferFreeSecs; + waitBufferFreeUSecs = StrategyControl->waitBufferFreeUSecs; + waitBufferFree = StrategyControl->waitBufferFree; + + waitSyncStartSecs = StrategyControl->waitSyncStartSecs; + waitSyncStartUSecs = StrategyControl->waitSyncStartUSecs; + waitSyncStart = StrategyControl->waitSyncStart; + + StrategyControl->waitBufferAllocSecs = 0; + StrategyControl->waitBufferAllocUSecs = 0; + StrategyControl->waitBufferAlloc = 0; + + StrategyControl->waitBufferFreeSecs = 0; + StrategyControl->waitBufferFreeUSecs = 0; + StrategyControl->waitBufferFree = 0; + + StrategyControl->waitSyncStartSecs = 0; + StrategyControl->waitSyncStartUSecs = 0; + StrategyControl->waitSyncStart = 0; + + LWLockRelease(BufFreelistLock); + + elog(LOG, "BufFreelistLock stats: " + "BufferAlloc waits %d total wait time=%ld.%03d s; " + "BufferFree waits %d total wait time=%ld.%03d s; " + "SyncStart waits %d total wait time=%ld.%03d s; ", + waitBufferAlloc, waitBufferAllocSecs, waitBufferAllocUSecs, + waitBufferFree, waitBufferFreeSecs, waitBufferFreeUSecs, + waitSyncStart, waitSyncStartSecs, waitSyncStartUSecs); + + return result; + } + LWLockRelease(BufFreelistLock); + return result; }
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers