Greg Smith wrote:
I think the right way to compute "relations to sync" is to finish the
sorted writes patch I sent over a not quite right yet update to already
Attached update now makes much more sense than the misguided patch I
submitted two weesk ago. This takes the original sorted write code,
first adjusting it so it only allocates the memory its tag structure is
stored in once (in a kind of lazy way I can improve on right now). It
then computes a bunch of derived statistics from a single walk of the
sorted data on each pass through. Here's an example of what comes out:
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11809.0_0
DEBUG: BufferSync 2 dirty blocks in relation.segment_fork 11811.0_0
DEBUG: BufferSync 3 dirty blocks in relation.segment_fork 11812.0_0
DEBUG: BufferSync 3 dirty blocks in relation.segment_fork 16496.0_0
DEBUG: BufferSync 28 dirty blocks in relation.segment_fork 16499.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11638.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11640.0_0
DEBUG: BufferSync 2 dirty blocks in relation.segment_fork 11641.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11642.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11644.0_0
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11645.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11661.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11663.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11664.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11672.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11685.0_0
DEBUG: BufferSync 2097 buffers to write, 17 total dirty segment file(s)
expected to need sync
This is the first checkpoint after starting to populate a new pgbench
database. The next four show it extending into new segments:
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.1_0
DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s)
expected to need sync
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.2_0
DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s)
expected to need sync
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.3_0
DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s)
expected to need sync
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.4_0
DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s)
expected to need sync
The fact that it's always showing 2048 dirty blocks on these makes me
think I'm computing something wrong still, but the general idea here is
working now. I had to use some magic from the md layer to let bufmgr.c
know how its writes were going to get mapped into file segments and
correspondingly fsync calls later. Not happy about breaking the API
encapsulation there, but don't see an easy way to compute that data at
the per-segment level--and it's not like that's going to change in the
near future anyway.
I like this approach for a providing a map of how to spread syncs out
for a couple of reasons:
-It computes data that could be used to drive sync spread timing in a
relatively short amount of simple code.
-You get write sorting at the database level helping out the OS.
Everything I've been seeing recently on benchmarks says Linux at least
needs all the help it can get in that regard, even if block order
doesn't necessarily align perfectly with disk order.
-It's obvious how to take this same data and build a future model where
the time allocated for fsyncs was proportional to how much that
particular relation was touched.
Benchmarks of just the impact of the sorting step and continued bug
swatting to follow.
--
Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1f89e52..ef9df7d 100644
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 48,53 ****
--- 48,63 ----
#include "utils/rel.h"
#include "utils/resowner.h"
+ /*
+ * Checkpoint time mapping between the buffer id values and the associated
+ * buffer tags of dirty buffers to write
+ */
+ typedef struct BufAndTag
+ {
+ int buf_id;
+ BufferTag tag;
+ BlockNumber segNum;
+ } BufAndTag;
/* Note: these two macros only work on shared buffers, not local ones! */
#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
*************** int target_prefetch_pages = 0;
*** 78,83 ****
--- 88,96 ----
static volatile BufferDesc *InProgressBuf = NULL;
static bool IsForInput;
+ /* local state for BufferSync */
+ static BufAndTag *BufferTags;
+
/* local state for LockBufferForCleanup */
static volatile BufferDesc *PinCountWaitBuf = NULL;
*************** UnpinBuffer(volatile BufferDesc *buf, bo
*** 1158,1163 ****
--- 1171,1194 ----
}
}
+ static int
+ bufcmp(const void *a, const void *b)
+ {
+ const BufAndTag *lhs = (const BufAndTag *) a;
+ const BufAndTag *rhs = (const BufAndTag *) b;
+ int r;
+
+ r = memcmp(&lhs->tag.rnode, &rhs->tag.rnode, sizeof(lhs->tag.rnode));
+ if (r != 0)
+ return r;
+ if (lhs->tag.blockNum < rhs->tag.blockNum)
+ return -1;
+ else if (lhs->tag.blockNum > rhs->tag.blockNum)
+ return 1;
+ else
+ return 0;
+ }
+
/*
* BufferSync -- Write out all dirty buffers in the pool.
*
*************** static void
*** 1171,1180 ****
BufferSync(int flags)
{
int buf_id;
- int num_to_scan;
int num_to_write;
int num_written;
int mask = BM_DIRTY;
/* Make sure we can handle the pin inside SyncOneBuffer */
ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 1202,1221 ----
BufferSync(int flags)
{
int buf_id;
int num_to_write;
int num_written;
int mask = BM_DIRTY;
+ int dirty_buf;
+ int dirty_segments;
+ int segment_dirty_blocks;
+ Oid last_seen_rel;
+ ForkNumber last_seen_fork;
+ BlockNumber last_seen_seg;
+
+ if (BufferTags==NULL)
+ BufferTags = (BufAndTag *) palloc(sizeof(BufAndTag) * NBuffers);
+
+ Assert(BufferTags != NULL);
/* Make sure we can handle the pin inside SyncOneBuffer */
ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
*************** BufferSync(int flags)
*** 1216,1221 ****
--- 1257,1277 ----
if ((bufHdr->flags & mask) == mask)
{
bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+
+ /* Save to the BufferTags list for later sorting and writing */
+ BufferTags[num_to_write].buf_id = buf_id;
+ BufferTags[num_to_write].tag = bufHdr->tag;
+ /*
+ * That the buffer manager knows how the underlying _mdfd_getseg
+ * code in md.c will eventually compute the segment numbers,
+ * breaking files into 1GB blocks by default, is making for a leaky
+ * abstraction boundary here. Since the results are only used by
+ * a writing heuristic and are so simple to compute directly, it's
+ * hard to justify inventing a cleaner API just for this.
+ */
+ BufferTags[num_to_write].segNum = BufferTags[num_to_write].tag.blockNum
+ / ((BlockNumber) RELSEG_SIZE);
+
num_to_write++;
}
*************** BufferSync(int flags)
*** 1225,1246 ****
if (num_to_write == 0)
return; /* nothing to do */
TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
/*
* Loop over all buffers again, and write the ones (still) marked with
! * BM_CHECKPOINT_NEEDED. In this loop, we start at the clock sweep point
! * since we might as well dump soon-to-be-recycled buffers first.
*
* Note that we don't read the buffer alloc count here --- that should be
* left untouched till the next BgBufferSync() call.
! */
! buf_id = StrategySyncStart(NULL, NULL);
! num_to_scan = NBuffers;
num_written = 0;
! while (num_to_scan-- > 0)
! {
! volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
/*
* We don't need to acquire the lock here, because we're only looking
--- 1281,1351 ----
if (num_to_write == 0)
return; /* nothing to do */
+ /*
+ * Sort the list of buffers to write. It's then straightforward to
+ * count the approximate number of files involved. There may be
+ * some small error from buffers that turn out to be skipped below,
+ * but for the purposes the file count is needed that's acceptable.
+ */
+ qsort(BufferTags, num_to_write, sizeof(*BufferTags), bufcmp);
+
+ /*
+ * Count the number of unique node/fork/segment combinations. This relies
+ * on the sorted order to make sure all matching relation/segment/fork
+ * values, the effective of keys here, appear in one continuous section.
+ */
+
+ /* Initialize with the first entry in the dirty buffer list */
+ last_seen_rel = BufferTags[0].tag.rnode.relNode;
+ last_seen_fork = BufferTags[0].tag.forkNum;
+ last_seen_seg = BufferTags[0].segNum;
+ dirty_segments = 1;
+ segment_dirty_blocks = 1;
+
+ for (dirty_buf = 1; dirty_buf < num_to_write; dirty_buf++)
+ {
+ if ((last_seen_rel != BufferTags[dirty_buf].tag.rnode.relNode) ||
+ (last_seen_fork != BufferTags[dirty_buf].tag.forkNum) ||
+ (last_seen_seg != BufferTags[dirty_buf].segNum))
+ {
+ /* Report on previous set for a segment now that we have a total */
+ elog(DEBUG1,
+ "BufferSync %d dirty blocks in relation.segment_fork %d.%d_%d",
+ segment_dirty_blocks,last_seen_rel,last_seen_seg,last_seen_fork);
+
+ last_seen_rel=BufferTags[dirty_buf].tag.rnode.relNode;
+ last_seen_fork=BufferTags[dirty_buf].tag.forkNum;
+ last_seen_seg = BufferTags[dirty_buf].segNum;
+ dirty_segments++;
+ segment_dirty_blocks=0;
+ }
+ segment_dirty_blocks++;
+ }
+
+ /* Final reporting on the last entry found */
+ elog(DEBUG1,
+ "BufferSync %d dirty blocks in relation.segment_fork %d.%d_%d",
+ segment_dirty_blocks,last_seen_rel,last_seen_seg,last_seen_fork);
+
+ elog(DEBUG1,
+ "BufferSync %d buffers to write, %d total dirty segment file(s) expected to need sync",
+ num_to_write,dirty_segments);
+
TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
/*
* Loop over all buffers again, and write the ones (still) marked with
! * BM_CHECKPOINT_NEEDED.
*
* Note that we don't read the buffer alloc count here --- that should be
* left untouched till the next BgBufferSync() call.
! */
num_written = 0;
! for (dirty_buf = 0; dirty_buf < num_to_write; dirty_buf++)
! {
! volatile BufferDesc *bufHdr;
! buf_id = BufferTags[dirty_buf].buf_id;
! bufHdr = &BufferDescriptors[buf_id];
/*
* We don't need to acquire the lock here, because we're only looking
*************** BufferSync(int flags)
*** 1263,1282 ****
num_written++;
/*
- * We know there are at most num_to_write buffers with
- * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
- * num_written reaches num_to_write.
- *
- * Note that num_written doesn't include buffers written by
- * other backends, or by the bgwriter cleaning scan. That
- * means that the estimate of how much progress we've made is
- * conservative, and also that this test will often fail to
- * trigger. But it seems worth making anyway.
- */
- if (num_written >= num_to_write)
- break;
-
- /*
* Perform normal bgwriter duties and sleep to throttle our
* I/O rate.
*/
--- 1368,1373 ----
*************** BufferSync(int flags)
*** 1284,1292 ****
(double) num_written / num_to_write);
}
}
-
- if (++buf_id >= NBuffers)
- buf_id = 0;
}
/*
--- 1375,1380 ----
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers