On Thu, Feb 23, 2012 at 11:17 AM, Greg Smith <g...@2ndquadrant.com> wrote:

> A second fact that's visible from the TPS graphs over the test run, and
> obvious if you think about it, is that BGW writes force data to physical
> disk earlier than they otherwise might go there.  That's a subtle pattern in
> the graphs.  I expect that though, given one element to "do I write this?"
> in Linux is how old the write is.  Wondering about this really emphasises
> that I need to either add graphing of vmstat/iostat data to these graphs or
> switch to a benchmark that does that already.  I think I've got just enough
> people using pgbench-tools to justify the feature even if I plan to use the
> program less.

For me, that is the key point.

For the test being performed there is no value in things being written
earlier, since doing so merely overexercises the I/O.

We should note that there is no feedback process in the bgwriter to do
writes only when the level of dirty writes by backends is high enough
to warrant the activity. Note that Linux has a demand paging
algorithm, it doesn't just clean all of the time. That's the reason
you still see some swapping, because that activity is what wakes the
pager. We don't count the number of dirty writes by backends, we just
keep cleaning even when nobody wants it.

Earlier, I pointed out that bgwriter is being woken any time a user
marks a buffer dirty. That is overkill. The bgwriter should stay
asleep until a threshold number (TBD) of dirty writes is reached, then
it should wake up and do some cleaning. Having a continuously active
bgwriter is pointless, for some workloads whereas for others, it
helps. So having a sleeping bgwriter isn't just a power management
issue its a performance issue in some cases.

        /*
         * Even in cases where there's been little or no buffer allocation
         * activity, we want to make a small amount of progress through the 
buffer
         * cache so that as many reusable buffers as possible are clean after an
         * idle period.
         *
         * (scan_whole_pool_milliseconds / BgWriterDelay) computes how many 
times
         * the BGW will be called during the scan_whole_pool time; slice the
         * buffer pool into that many sections.
         */

Since scan_whole_pool_milliseconds is set to 2 minutes we scan the
whole bufferpool every 2 minutes, no matter how big the bufferpool,
even when nothing else is happening. Not cool.

I think it would be sensible to have bgwriter stop when 10% of
shared_buffers are clean, rather than keep going even when no dirty
writes are happening.

So my suggestion is that we put in an additional clause into
BgBufferSync() to allow min_scan_buffers to fall to zero when X% of
shared buffers is clean. After that bgwriter should sleep. And be
woken again only by a dirty write by a user backend. That sounds like
clean ratio will flip between 0 and X% but first dirty write will
occur long before we git zero, so that will cause bgwriter to attempt
to maintain a reasonably steady state clean ratio.



I would also take a wild guess that the 750 results are due to
freelist contention. To assess that, I post again the patch shown on
other threads designed to assess the overall level of freelist lwlock
contention.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3e62448..36b0160 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -17,6 +17,7 @@
 
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "utils/timestamp.h"
 
 
 /*
@@ -41,6 +42,21 @@ typedef struct
 	 */
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
+
+	/*
+	 * Wait Statistics
+	 */
+	long	waitBufferAllocSecs;
+	int		waitBufferAllocUSecs;
+	int		waitBufferAlloc;
+
+	long	waitBufferFreeSecs;
+	int		waitBufferFreeUSecs;
+	int		waitBufferFree;
+
+	long	waitSyncStartSecs;
+	int		waitSyncStartUSecs;
+	int		waitSyncStart;
 } BufferStrategyControl;
 
 /* Pointers to shared state */
@@ -125,7 +141,29 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 
 	/* Nope, so lock the freelist */
 	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	if (!LWLockConditionalAcquire(BufFreelistLock, LW_EXCLUSIVE))
+	{
+		TimestampTz waitStart = GetCurrentTimestamp();
+		TimestampTz waitEnd;
+		long		wait_secs;
+		int			wait_usecs;
+
+		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+
+		waitEnd = GetCurrentTimestamp();
+
+		TimestampDifference(waitStart, waitEnd,
+						&wait_secs, &wait_usecs);
+
+		StrategyControl->waitBufferAllocSecs += wait_secs;
+		StrategyControl->waitBufferAllocUSecs += wait_usecs;
+		if (StrategyControl->waitBufferAllocUSecs > 1000000)
+		{
+			StrategyControl->waitBufferAllocUSecs -= 1000000;
+			StrategyControl->waitBufferAllocSecs += 1;
+		}
+		StrategyControl->waitBufferAlloc++;
+	}
 
 	/*
 	 * We count buffer allocation requests so that the bgwriter can estimate
@@ -223,7 +261,29 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	if (!LWLockConditionalAcquire(BufFreelistLock, LW_EXCLUSIVE))
+	{
+		TimestampTz waitStart = GetCurrentTimestamp();
+		TimestampTz waitEnd;
+		long		wait_secs;
+		int			wait_usecs;
+
+		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+
+		waitEnd = GetCurrentTimestamp();
+
+		TimestampDifference(waitStart, waitEnd,
+						&wait_secs, &wait_usecs);
+
+		StrategyControl->waitBufferFreeSecs += wait_secs;
+		StrategyControl->waitBufferFreeUSecs += wait_usecs;
+		if (StrategyControl->waitBufferFreeUSecs > 1000000)
+		{
+			StrategyControl->waitBufferFreeUSecs -= 1000000;
+			StrategyControl->waitBufferFreeSecs += 1;
+		}
+		StrategyControl->waitBufferFree++;
+	}
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -256,7 +316,30 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	int			result;
 
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	if (!LWLockConditionalAcquire(BufFreelistLock, LW_EXCLUSIVE))
+	{
+		TimestampTz waitStart = GetCurrentTimestamp();
+		TimestampTz waitEnd;
+		long		wait_secs;
+		int			wait_usecs;
+
+		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+
+		waitEnd = GetCurrentTimestamp();
+
+		TimestampDifference(waitStart, waitEnd,
+						&wait_secs, &wait_usecs);
+
+		StrategyControl->waitSyncStartSecs += wait_secs;
+		StrategyControl->waitSyncStartUSecs += wait_usecs;
+		if (StrategyControl->waitSyncStartUSecs > 1000000)
+		{
+			StrategyControl->waitSyncStartUSecs -= 1000000;
+			StrategyControl->waitSyncStartSecs += 1;
+		}
+		StrategyControl->waitSyncStart++;
+	}
+
 	result = StrategyControl->nextVictimBuffer;
 	if (complete_passes)
 		*complete_passes = StrategyControl->completePasses;
@@ -265,7 +348,59 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 		*num_buf_alloc = StrategyControl->numBufferAllocs;
 		StrategyControl->numBufferAllocs = 0;
 	}
+	else
+	{
+		long	waitBufferAllocSecs;
+		int		waitBufferAllocUSecs;
+		int		waitBufferAlloc;
+
+		long	waitBufferFreeSecs;
+		int		waitBufferFreeUSecs;
+		int		waitBufferFree;
+
+		long	waitSyncStartSecs;
+		int		waitSyncStartUSecs;
+		int		waitSyncStart;
+
+		waitBufferAllocSecs = StrategyControl->waitBufferAllocSecs;
+		waitBufferAllocUSecs = StrategyControl->waitBufferAllocUSecs;
+		waitBufferAlloc = StrategyControl->waitBufferAlloc;
+
+		waitBufferFreeSecs = StrategyControl->waitBufferFreeSecs;
+		waitBufferFreeUSecs = StrategyControl->waitBufferFreeUSecs;
+		waitBufferFree = StrategyControl->waitBufferFree;
+
+		waitSyncStartSecs = StrategyControl->waitSyncStartSecs;
+		waitSyncStartUSecs = StrategyControl->waitSyncStartUSecs;
+		waitSyncStart = StrategyControl->waitSyncStart;
+
+		StrategyControl->waitBufferAllocSecs = 0;
+		StrategyControl->waitBufferAllocUSecs = 0;
+		StrategyControl->waitBufferAlloc = 0;
+
+		StrategyControl->waitBufferFreeSecs = 0;
+		StrategyControl->waitBufferFreeUSecs = 0;
+		StrategyControl->waitBufferFree = 0;
+
+		StrategyControl->waitSyncStartSecs = 0;
+		StrategyControl->waitSyncStartUSecs = 0;
+		StrategyControl->waitSyncStart = 0;
+
+		LWLockRelease(BufFreelistLock);
+
+		elog(LOG, "BufFreelistLock stats: "
+			 "BufferAlloc waits %d total wait time=%ld.%03d s; "
+			 "BufferFree waits %d total wait time=%ld.%03d s; "
+			 "SyncStart waits %d total wait time=%ld.%03d s; ",
+			waitBufferAlloc, waitBufferAllocSecs, waitBufferAllocUSecs,
+			waitBufferFree, waitBufferFreeSecs, waitBufferFreeUSecs,
+			waitSyncStart, waitSyncStartSecs, waitSyncStartUSecs);
+
+		return result;
+	}
+
 	LWLockRelease(BufFreelistLock);
+
 	return result;
 }
 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to