Re: [HACKERS] Spread checkpoint sync

Greg Smith Tue, 01 Feb 2011 07:49:50 -0800

Greg Smith wrote:

I think the right way to compute "relations to sync" is to finish thesorted writes patch I sent over a not quite right yet update to already

Attached update now makes much more sense than the misguided patch Isubmitted two weesk ago. This takes the original sorted write code,first adjusting it so it only allocates the memory its tag structure isstored in once (in a kind of lazy way I can improve on right now). Itthen computes a bunch of derived statistics from a single walk of thesorted data on each pass through. Here's an example of what comes out:


DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11809.0_0
DEBUG:  BufferSync 2 dirty blocks in relation.segment_fork 11811.0_0
DEBUG:  BufferSync 3 dirty blocks in relation.segment_fork 11812.0_0
DEBUG:  BufferSync 3 dirty blocks in relation.segment_fork 16496.0_0
DEBUG:  BufferSync 28 dirty blocks in relation.segment_fork 16499.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11638.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11640.0_0
DEBUG:  BufferSync 2 dirty blocks in relation.segment_fork 11641.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11642.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11644.0_0
DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11645.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11661.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11663.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11664.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11672.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11685.0_0

DEBUG: BufferSync 2097 buffers to write, 17 total dirty segment file(s)expected to need sync

This is the first checkpoint after starting to populate a new pgbenchdatabase. The next four show it extending into new segments:


DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.1_0

DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s)expected to need sync


DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.2_0

DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s)expected to need sync


DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.3_0

DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s)expected to need sync


DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.4_0

DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s)expected to need sync

The fact that it's always showing 2048 dirty blocks on these makes methink I'm computing something wrong still, but the general idea here isworking now. I had to use some magic from the md layer to let bufmgr.cknow how its writes were going to get mapped into file segments andcorrespondingly fsync calls later. Not happy about breaking the APIencapsulation there, but don't see an easy way to compute that data atthe per-segment level--and it's not like that's going to change in thenear future anyway.

I like this approach for a providing a map of how to spread syncs outfor a couple of reasons:

-It computes data that could be used to drive sync spread timing in arelatively short amount of simple code.

-You get write sorting at the database level helping out the OS.Everything I've been seeing recently on benchmarks says Linux at leastneeds all the help it can get in that regard, even if block orderdoesn't necessarily align perfectly with disk order.

-It's obvious how to take this same data and build a future model wherethe time allocated for fsyncs was proportional to how much thatparticular relation was touched.

Benchmarks of just the impact of the sorting step and continued bugswatting to follow.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1f89e52..ef9df7d 100644
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 48,53 ****
--- 48,63 ----
  #include "utils/rel.h"
  #include "utils/resowner.h"
  
+ /*
+  * Checkpoint time mapping between the buffer id values and the associated
+  * buffer tags of dirty buffers to write
+  */
+ typedef struct BufAndTag
+ {
+     int         buf_id;
+     BufferTag   tag;
+ 	BlockNumber	segNum;
+ } BufAndTag;
  
  /* Note: these two macros only work on shared buffers, not local ones! */
  #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
*************** int			target_prefetch_pages = 0;
*** 78,83 ****
--- 88,96 ----
  static volatile BufferDesc *InProgressBuf = NULL;
  static bool IsForInput;
  
+ /* local state for BufferSync */
+ static BufAndTag *BufferTags;
+ 
  /* local state for LockBufferForCleanup */
  static volatile BufferDesc *PinCountWaitBuf = NULL;
  
*************** UnpinBuffer(volatile BufferDesc *buf, bo
*** 1158,1163 ****
--- 1171,1194 ----
  	}
  }
  
+ static int
+ bufcmp(const void *a, const void *b)
+ {
+ 	const BufAndTag *lhs = (const BufAndTag *) a;
+ 	const BufAndTag *rhs = (const BufAndTag *) b;
+ 	int		r;
+ 
+ 	r = memcmp(&lhs->tag.rnode, &rhs->tag.rnode, sizeof(lhs->tag.rnode));
+ 	if (r != 0)
+ 		return r;
+ 	if (lhs->tag.blockNum < rhs->tag.blockNum)
+ 		return -1;
+ 	else if (lhs->tag.blockNum > rhs->tag.blockNum)
+ 		return 1;
+ 	else
+ 		return 0;
+ }
+ 
  /*
   * BufferSync -- Write out all dirty buffers in the pool.
   *
*************** static void
*** 1171,1180 ****
  BufferSync(int flags)
  {
  	int			buf_id;
- 	int			num_to_scan;
  	int			num_to_write;
  	int			num_written;
  	int			mask = BM_DIRTY;
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 1202,1221 ----
  BufferSync(int flags)
  {
  	int			buf_id;
  	int			num_to_write;
  	int			num_written;
  	int			mask = BM_DIRTY;
+ 	int			dirty_buf;
+ 	int			dirty_segments;
+ 	int			segment_dirty_blocks;
+ 	Oid			last_seen_rel;
+ 	ForkNumber  last_seen_fork;
+ 	BlockNumber last_seen_seg;
+ 
+ 	if (BufferTags==NULL)
+ 		BufferTags = (BufAndTag *) palloc(sizeof(BufAndTag) * NBuffers);
+ 
+ 	Assert(BufferTags != NULL);
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
*************** BufferSync(int flags)
*** 1216,1221 ****
--- 1257,1277 ----
  		if ((bufHdr->flags & mask) == mask)
  		{
  			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+ 			
+ 			/* Save to the BufferTags list for later sorting and writing */
+ 			BufferTags[num_to_write].buf_id = buf_id;
+ 			BufferTags[num_to_write].tag = bufHdr->tag;
+ 			/*
+ 			 * That the buffer manager knows how the underlying _mdfd_getseg
+ 			 * code in md.c will eventually compute the segment numbers,
+ 			 * breaking files into 1GB blocks by default, is making for a leaky
+ 			 * abstraction boundary here.  Since the results are only used by
+ 			 * a writing heuristic and are so simple to compute directly, it's
+ 			 * hard to justify inventing a cleaner API just for this.
+ 			 */			
+ 			BufferTags[num_to_write].segNum = BufferTags[num_to_write].tag.blockNum 
+ 				/ ((BlockNumber) RELSEG_SIZE);
+ 		
  			num_to_write++;
  		}
  
*************** BufferSync(int flags)
*** 1225,1246 ****
  	if (num_to_write == 0)
  		return;					/* nothing to do */
  
  	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
  
  	/*
  	 * Loop over all buffers again, and write the ones (still) marked with
! 	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
! 	 * since we might as well dump soon-to-be-recycled buffers first.
  	 *
  	 * Note that we don't read the buffer alloc count here --- that should be
  	 * left untouched till the next BgBufferSync() call.
! 	 */
! 	buf_id = StrategySyncStart(NULL, NULL);
! 	num_to_scan = NBuffers;
  	num_written = 0;
! 	while (num_to_scan-- > 0)
! 	{
! 		volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
  
  		/*
  		 * We don't need to acquire the lock here, because we're only looking
--- 1281,1351 ----
  	if (num_to_write == 0)
  		return;					/* nothing to do */
  
+ 	/*
+ 	 * Sort the list of buffers to write.  It's then straightforward to
+ 	 * count the approximate number of files involved.  There may be
+ 	 * some small error from buffers that turn out to be skipped below,
+ 	 * but for the purposes the file count is needed that's acceptable.
+ 	 */
+ 	qsort(BufferTags, num_to_write, sizeof(*BufferTags), bufcmp);
+ 
+ 	/*
+ 	 * Count the number of unique node/fork/segment combinations.  This relies
+ 	 * on the sorted order to make sure all matching relation/segment/fork
+ 	 * values, the effective of keys here, appear in one continuous section.
+ 	 */
+ 
+ 	/* Initialize with the first entry in the dirty buffer list */
+ 	last_seen_rel = BufferTags[0].tag.rnode.relNode;
+ 	last_seen_fork = BufferTags[0].tag.forkNum;
+ 	last_seen_seg = BufferTags[0].segNum;
+ 	dirty_segments = 1;
+ 	segment_dirty_blocks = 1;
+ 
+ 	for (dirty_buf = 1; dirty_buf < num_to_write; dirty_buf++)
+   	{
+ 		if ((last_seen_rel != BufferTags[dirty_buf].tag.rnode.relNode) ||
+     		(last_seen_fork != BufferTags[dirty_buf].tag.forkNum) ||
+     		(last_seen_seg != BufferTags[dirty_buf].segNum))
+ 		{
+ 			/* Report on previous set for a segment now that we have a total */
+     		elog(DEBUG1, 
+     			"BufferSync %d dirty blocks in relation.segment_fork %d.%d_%d",
+ 	    		segment_dirty_blocks,last_seen_rel,last_seen_seg,last_seen_fork);
+ 
+ 		    last_seen_rel=BufferTags[dirty_buf].tag.rnode.relNode;
+ 		    last_seen_fork=BufferTags[dirty_buf].tag.forkNum;
+ 			last_seen_seg = BufferTags[dirty_buf].segNum;
+ 		    dirty_segments++;
+ 		    segment_dirty_blocks=0;
+ 		}
+ 		segment_dirty_blocks++;
+ 	}
+ 
+ 	/* Final reporting on the last entry found */
+ 	elog(DEBUG1, 
+     	"BufferSync %d dirty blocks in relation.segment_fork %d.%d_%d",
+ 	    segment_dirty_blocks,last_seen_rel,last_seen_seg,last_seen_fork);
+ 	    
+ 	elog(DEBUG1, 
+     	"BufferSync %d buffers to write, %d total dirty segment file(s) expected to need sync",
+ 	    num_to_write,dirty_segments);
+ 	    
  	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
  
  	/*
  	 * Loop over all buffers again, and write the ones (still) marked with
! 	 * BM_CHECKPOINT_NEEDED.
  	 *
  	 * Note that we don't read the buffer alloc count here --- that should be
  	 * left untouched till the next BgBufferSync() call.
! 	 */ 	 	
  	num_written = 0;
! 	for (dirty_buf = 0; dirty_buf < num_to_write; dirty_buf++)
!   	{
! 		volatile BufferDesc *bufHdr;  
! 		buf_id = BufferTags[dirty_buf].buf_id;
! 		bufHdr = &BufferDescriptors[buf_id];
  
  		/*
  		 * We don't need to acquire the lock here, because we're only looking
*************** BufferSync(int flags)
*** 1263,1282 ****
  				num_written++;
  
  				/*
- 				 * We know there are at most num_to_write buffers with
- 				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
- 				 * num_written reaches num_to_write.
- 				 *
- 				 * Note that num_written doesn't include buffers written by
- 				 * other backends, or by the bgwriter cleaning scan. That
- 				 * means that the estimate of how much progress we've made is
- 				 * conservative, and also that this test will often fail to
- 				 * trigger.  But it seems worth making anyway.
- 				 */
- 				if (num_written >= num_to_write)
- 					break;
- 
- 				/*
  				 * Perform normal bgwriter duties and sleep to throttle our
  				 * I/O rate.
  				 */
--- 1368,1373 ----
*************** BufferSync(int flags)
*** 1284,1292 ****
  									 (double) num_written / num_to_write);
  			}
  		}
- 
- 		if (++buf_id >= NBuffers)
- 			buf_id = 0;
  	}
  
  	/*
--- 1375,1380 ----

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

Reply via email to