We have a customer who was unable to migrate from a well-known
commercial database product to pg because they have a very large
software base that holds cursors open for very long periods of
time, preventing GlobalXmin values from advancing, leading to
bloat.  They have a standard test that exercises hundreds of client
connections for days at a time, and saw disk space and CPU usage
climbing to unacceptable levels[1].  The other database product had
no such problems.  We suggested the obvious solutions that we
always suggest on the community lists, and they had perfectly good
reasons why those were not viable for them -- we either needed to
prevent bloat under their current software or they could not
migrate.

What they wanted was what happened in the other database product --
if a snapshot got sufficiently old that cleaning up the MVCC data
was a problem *and* the snapshot was used again *and* it read a
page which had been modified far enough back that it was not
possible to return correct data, then they wanted to receive a
"snapshot too old" error.  I wrote a patch to do that, but they
didn't seem much improvement, because all (auto)vacuum processes
blocked indefinitely on the btree index pages where a cursor was
parked.  So they still got massive bloat and consequent problems.

It seemed odd to me that a btree implementation based on the
Lehman & Yao techniques would block anything, because the concept
is that it reads everything it needs off the index page and pauses
"between" pages.  We sorta do that, but the "interlock" between
vacuum processes and other index usages basically involves holding
a pin on the page we just read until we are done using the values
we read off that page, and treating the pin as a lighter lock than
a shared (or READ) lock -- which only conflicts with a "super
exclusive" lock, which consists of getting an exclusive lock only
once there are no other processes holding a pin on the page.  This
ensures that at the end of a vacuum pass over a btree index there
are no TIDs in process-local memory from before the start of the
pass, and consequently no scan can read a new tuple assigned to the
same TID value after the rest of the vacuum phases run.  So a pin
on a btree page blocks a vacuum process indefinitely.

Interestingly, the btree README points out that using the old TID
with a new tuple poses no hazard for a scan using an MVCC snapshot,
because the new tuple would not be visible to a snapshot created
that long ago.  The customer's cursors which were causing the
problems all use MVCC snapshots, so I went looking to see whether
we couldn't take advantage of this fact.  For the most part it
seemed we were OK if we dropped pins with the READ lock for a btree
scan which used an MVCC snapshot.  I found that the LP_DEAD hinting
would be a problem with an old TID, but I figured we could work
around that by storing the page LSN into the scan position structure
when the page contents were read, and only doing hinting if that
matched when we were ready to do the hinting.  That wouldn't work
for an index which was not WAL-logged, so the patch still holds
pins for those.  Robert pointed out that the visibility information
for an index-only scan wasn't checked while the index page READ
lock was held, so those scans also still hold the pins.

There was also a problem with the fact that the code conflated the
notion of having a valid scan position with holding a pin on a
block in the index.  Those two concepts needed to be teased apart,
which I did using some new macros and figuring out which one to use
where.  Without a pin on the block, we also needed to remember what
block we had been processing to be able to do the LP_DEAD hinting;
for that the block number was added to the structure which tracks
scan position.

Finally, there was an "optimization" for marking buffer position
for possible restore that was incompatible with releasing the pin.
I use quotes because the optimization adds overhead to every move
to the next page in order set some variables in a structure when a
mark is requested instead of running two fairly small memcpy()
statements.  The two-day benchmark of the customer showed no
performance hit, and looking at the code I would be amazed if the
optimization yielded a measurable benefit.  In general, optimization
by adding overhead to moving through a scan to save time in a mark
operation seems dubious.

At some point we could consider building on this patch to recheck
index conditions for heap access when a non-MVCC snapshot is used,
check the visibility map for referenced heap pages when the TIDs
are read for an index-only scan, and/or skip LP_DEAD hinting for
non-WAL-logged indexes.  But all those are speculative future work;
this is a conservative implementation that just didn't modify
pinning where there were any confounding factors.

In combination with the snapshot-too-old patch the 2-day customer
benchmark ran without problem levels of bloat, controlling disk
space growth and keeping CPU usage level.  Without the other patch
this patch would at least allow autovacuum to maintain statistics,
which probably makes it worthwhile even without the other patch,
but will probably not have a very big impact on bloat without both.

At least two README files and a lot of comments need to be modified
before commit to reflect the change in approach if this is accepted
by the community.  Since this work has been very recent I haven't
had time to do that yet.

git diff --stat tells me this patch has:

4 files changed, 208 insertions(+), 128 deletions(-)

... and the related "snapshot too old" patch has:

16 files changed, 145 insertions(+), 19 deletions(-)

Given that these are going into the last CF and the related patch
is really still at "proof of concept" phase, it may be too late to
consider them for PostgreSQL 9.5, but I would still like a serious
review now in case people feel strongly about controlling bloat in
the face of old snapshots, and to know whether to work on this as
just for the Postgres Plus Advanced Server fork or for the
community.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


[1] This test maintains a certain transaction rate, with user think
times on the connections, so performance isn't measured by a tps
rate but by how much CPU is consumed to maintain that rate.
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 404,410 **** btbeginscan(PG_FUNCTION_ARGS)
  
  	/* allocate private workspace */
  	so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
! 	so->currPos.buf = so->markPos.buf = InvalidBuffer;
  	if (scan->numberOfKeys > 0)
  		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
  	else
--- 404,411 ----
  
  	/* allocate private workspace */
  	so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
! 	BTScanPosInvalidate(so->currPos);
! 	BTScanPosInvalidate(so->markPos);
  	if (scan->numberOfKeys > 0)
  		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
  	else
***************
*** 424,431 **** btbeginscan(PG_FUNCTION_ARGS)
  	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
  	 */
  	so->currTuples = so->markTuples = NULL;
- 	so->currPos.nextTupleOffset = 0;
- 	so->markPos.nextTupleOffset = 0;
  
  	scan->xs_itupdesc = RelationGetDescr(rel);
  
--- 425,430 ----
***************
*** 451,467 **** btrescan(PG_FUNCTION_ARGS)
  	{
  		/* Before leaving current page, deal with any killed items */
  		if (so->numKilled > 0)
! 			_bt_killitems(scan, false);
! 		ReleaseBuffer(so->currPos.buf);
! 		so->currPos.buf = InvalidBuffer;
  	}
  
! 	if (BTScanPosIsValid(so->markPos))
! 	{
! 		ReleaseBuffer(so->markPos.buf);
! 		so->markPos.buf = InvalidBuffer;
! 	}
! 	so->markItemIndex = -1;
  
  	/*
  	 * Allocate tuple workspace arrays, if needed for an index-only scan and
--- 450,462 ----
  	{
  		/* Before leaving current page, deal with any killed items */
  		if (so->numKilled > 0)
! 			_bt_killitems(scan);
! 		BTScanPosUnpinIfPinned(so->currPos);
! 		BTScanPosInvalidate(so->currPos);
  	}
  
! 	BTScanPosUnpinIfPinned(so->markPos);
! 	BTScanPosInvalidate(so->markPos);
  
  	/*
  	 * Allocate tuple workspace arrays, if needed for an index-only scan and
***************
*** 515,531 **** btendscan(PG_FUNCTION_ARGS)
  	{
  		/* Before leaving current page, deal with any killed items */
  		if (so->numKilled > 0)
! 			_bt_killitems(scan, false);
! 		ReleaseBuffer(so->currPos.buf);
! 		so->currPos.buf = InvalidBuffer;
  	}
  
! 	if (BTScanPosIsValid(so->markPos))
! 	{
! 		ReleaseBuffer(so->markPos.buf);
! 		so->markPos.buf = InvalidBuffer;
! 	}
! 	so->markItemIndex = -1;
  
  	/* Release storage */
  	if (so->keyData != NULL)
--- 510,522 ----
  	{
  		/* Before leaving current page, deal with any killed items */
  		if (so->numKilled > 0)
! 			_bt_killitems(scan);
! 		BTScanPosUnpinIfPinned(so->currPos);
  	}
  
! 	BTScanPosUnpinIfPinned(so->markPos);
! 
! 	/* No need to invalidate positions, the RAM is about to be freed. */
  
  	/* Release storage */
  	if (so->keyData != NULL)
***************
*** 552,574 **** btmarkpos(PG_FUNCTION_ARGS)
  	IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
  	BTScanOpaque so = (BTScanOpaque) scan->opaque;
  
! 	/* we aren't holding any read locks, but gotta drop the pin */
! 	if (BTScanPosIsValid(so->markPos))
! 	{
! 		ReleaseBuffer(so->markPos.buf);
! 		so->markPos.buf = InvalidBuffer;
! 	}
  
! 	/*
! 	 * Just record the current itemIndex.  If we later step to next page
! 	 * before releasing the marked position, _bt_steppage makes a full copy of
! 	 * the currPos struct in markPos.  If (as often happens) the mark is moved
! 	 * before we leave the page, we don't have to do that work.
! 	 */
! 	if (BTScanPosIsValid(so->currPos))
! 		so->markItemIndex = so->currPos.itemIndex;
! 	else
! 		so->markItemIndex = -1;
  
  	/* Also record the current positions of any array keys */
  	if (so->numArrayKeys)
--- 543,557 ----
  	IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
  	BTScanOpaque so = (BTScanOpaque) scan->opaque;
  
! 	memcpy(&so->markPos, &so->currPos,
! 		   offsetof(BTScanPosData, items[1]) +
! 		   so->currPos.lastItem * sizeof(BTScanPosItem));
! 	if (so->markTuples)
! 		memcpy(so->markTuples, so->currTuples,
! 			   so->currPos.nextTupleOffset);
  
! 	/* We don't take out an extra pin for the mark position. */
! 	so->markPos.buf = InvalidBuffer;
  
  	/* Also record the current positions of any array keys */
  	if (so->numArrayKeys)
***************
*** 590,602 **** btrestrpos(PG_FUNCTION_ARGS)
  	if (so->numArrayKeys)
  		_bt_restore_array_keys(scan);
  
! 	if (so->markItemIndex >= 0)
  	{
  		/*
  		 * The mark position is on the same page we are currently on. Just
  		 * restore the itemIndex.
  		 */
! 		so->currPos.itemIndex = so->markItemIndex;
  	}
  	else
  	{
--- 573,585 ----
  	if (so->numArrayKeys)
  		_bt_restore_array_keys(scan);
  
! 	if (so->markPos.currPage == so->currPos.currPage)
  	{
  		/*
  		 * The mark position is on the same page we are currently on. Just
  		 * restore the itemIndex.
  		 */
! 		so->currPos.itemIndex = so->markPos.itemIndex;
  	}
  	else
  	{
***************
*** 606,620 **** btrestrpos(PG_FUNCTION_ARGS)
  			/* Before leaving current page, deal with any killed items */
  			if (so->numKilled > 0 &&
  				so->currPos.buf != so->markPos.buf)
! 				_bt_killitems(scan, false);
! 			ReleaseBuffer(so->currPos.buf);
! 			so->currPos.buf = InvalidBuffer;
  		}
  
  		if (BTScanPosIsValid(so->markPos))
  		{
- 			/* bump pin on mark buffer for assignment to current buffer */
- 			IncrBufferRefCount(so->markPos.buf);
  			memcpy(&so->currPos, &so->markPos,
  				   offsetof(BTScanPosData, items[1]) +
  				   so->markPos.lastItem * sizeof(BTScanPosItem));
--- 589,600 ----
  			/* Before leaving current page, deal with any killed items */
  			if (so->numKilled > 0 &&
  				so->currPos.buf != so->markPos.buf)
! 				_bt_killitems(scan);
! 			BTScanPosUnpinIfPinned(so->currPos);
  		}
  
  		if (BTScanPosIsValid(so->markPos))
  		{
  			memcpy(&so->currPos, &so->markPos,
  				   offsetof(BTScanPosData, items[1]) +
  				   so->markPos.lastItem * sizeof(BTScanPosItem));
***************
*** 622,627 **** btrestrpos(PG_FUNCTION_ARGS)
--- 602,609 ----
  				memcpy(so->currTuples, so->markTuples,
  					   so->markPos.nextTupleOffset);
  		}
+ 		else
+ 			BTScanPosInvalidate(so->currPos);
  	}
  
  	PG_RETURN_VOID();
*** a/src/backend/access/nbtree/nbtsearch.c
--- b/src/backend/access/nbtree/nbtsearch.c
***************
*** 22,27 ****
--- 22,28 ----
  #include "storage/predicate.h"
  #include "utils/lsyscache.h"
  #include "utils/rel.h"
+ #include "utils/tqual.h"
  
  
  static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
***************
*** 31,36 **** static void _bt_saveitem(BTScanOpaque so, int itemIndex,
--- 32,67 ----
  static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
  static Buffer _bt_walk_left(Relation rel, Buffer buf);
  static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
+ static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+ 
+ 
+ /*
+  *	_bt_drop_lock_and_maybe_pin()
+  *
+  * Unlock the buffer; and if it is safe to release the pin, do that, too.  It
+  * is safe if the scan is using an MVCC snapshot and the index is WAL-logged.
+  * This will prevent vacuum from stalling in a blocked state trying to read a
+  * page when a cursor is sitting on it -- at least in many important cases.
+  *
+  * Set the buffer to invalid if the pin is released, since the buffer may be
+  * re-used.  If we need to go back to this block (for example, to apply
+  * LP_DEAD hints) we must get a fresh reference to the buffer.  Hopefully it
+  * will remain in shared memory for as long as it takes to scan the index
+  * buffer page.
+  */
+ static void
+ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
+ {
+ 	LockBuffer(sp->buf, BUFFER_LOCK_UNLOCK);
+ 
+ 	if (IsMVCCSnapshot(scan->xs_snapshot) &&
+ 		RelationNeedsWAL(scan->indexRelation) &&
+ 		!scan->xs_want_itup)
+ 	{
+ 		ReleaseBuffer(sp->buf);
+ 		sp->buf = InvalidBuffer;
+ 	}
+ }
  
  
  /*
***************
*** 506,511 **** _bt_first(IndexScanDesc scan, ScanDirection dir)
--- 537,544 ----
  	StrategyNumber strat_total;
  	BTScanPosItem *currItem;
  
+ 	Assert(!BTScanPosIsValid(so->currPos));
+ 
  	pgstat_count_index_scan(rel);
  
  	/*
***************
*** 942,950 **** _bt_first(IndexScanDesc scan, ScanDirection dir)
  	/* don't need to keep the stack around... */
  	_bt_freestack(stack);
  
- 	/* remember which buffer we have pinned, if any */
- 	so->currPos.buf = buf;
- 
  	if (!BufferIsValid(buf))
  	{
  		/*
--- 975,980 ----
***************
*** 970,976 **** _bt_first(IndexScanDesc scan, ScanDirection dir)
  		so->currPos.moreRight = false;
  	}
  	so->numKilled = 0;			/* just paranoia */
- 	so->markItemIndex = -1;		/* ditto */
  
  	/* position to the precise item on the page */
  	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
--- 1000,1005 ----
***************
*** 1023,1028 **** _bt_first(IndexScanDesc scan, ScanDirection dir)
--- 1052,1061 ----
  			break;
  	}
  
+ 	/* remember which buffer we have pinned, if any */
+ 	Assert(!BTScanPosIsValid(so->currPos));
+ 	so->currPos.buf = buf;
+ 
  	/*
  	 * Now load data from the first page of the scan.
  	 */
***************
*** 1032,1043 **** _bt_first(IndexScanDesc scan, ScanDirection dir)
  		 * There's no actually-matching data on this page.  Try to advance to
  		 * the next page.  Return false if there's no matching data at all.
  		 */
  		if (!_bt_steppage(scan, dir))
  			return false;
  	}
! 
! 	/* Drop the lock, but not pin, on the current page */
! 	LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
  
  	/* OK, itemIndex says what to return */
  	currItem = &so->currPos.items[so->currPos.itemIndex];
--- 1065,1079 ----
  		 * There's no actually-matching data on this page.  Try to advance to
  		 * the next page.  Return false if there's no matching data at all.
  		 */
+ 		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
  		if (!_bt_steppage(scan, dir))
  			return false;
  	}
! 	else
! 	{
! 		/* Drop the lock, and maybe the pin, on the current page */
! 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
! 	}
  
  	/* OK, itemIndex says what to return */
  	currItem = &so->currPos.items[so->currPos.itemIndex];
***************
*** 1051,1058 **** _bt_first(IndexScanDesc scan, ScanDirection dir)
  /*
   *	_bt_next() -- Get the next item in a scan.
   *
!  *		On entry, so->currPos describes the current page, which is pinned
!  *		but not locked, and so->currPos.itemIndex identifies which item was
   *		previously returned.
   *
   *		On successful exit, scan->xs_ctup.t_self is set to the TID of the
--- 1087,1094 ----
  /*
   *	_bt_next() -- Get the next item in a scan.
   *
!  *		On entry, so->currPos describes the current page, which may be pinned
!  *		but is not locked, and so->currPos.itemIndex identifies which item was
   *		previously returned.
   *
   *		On successful exit, scan->xs_ctup.t_self is set to the TID of the
***************
*** 1076,1101 **** _bt_next(IndexScanDesc scan, ScanDirection dir)
  	{
  		if (++so->currPos.itemIndex > so->currPos.lastItem)
  		{
- 			/* We must acquire lock before applying _bt_steppage */
- 			Assert(BufferIsValid(so->currPos.buf));
- 			LockBuffer(so->currPos.buf, BT_READ);
  			if (!_bt_steppage(scan, dir))
  				return false;
- 			/* Drop the lock, but not pin, on the new page */
- 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
  		}
  	}
  	else
  	{
  		if (--so->currPos.itemIndex < so->currPos.firstItem)
  		{
- 			/* We must acquire lock before applying _bt_steppage */
- 			Assert(BufferIsValid(so->currPos.buf));
- 			LockBuffer(so->currPos.buf, BT_READ);
  			if (!_bt_steppage(scan, dir))
  				return false;
- 			/* Drop the lock, but not pin, on the new page */
- 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
  		}
  	}
  
--- 1112,1127 ----
***************
*** 1135,1141 **** _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
  	IndexTuple	itup;
  	bool		continuescan;
  
! 	/* we must have the buffer pinned and locked */
  	Assert(BufferIsValid(so->currPos.buf));
  
  	page = BufferGetPage(so->currPos.buf);
--- 1161,1170 ----
  	IndexTuple	itup;
  	bool		continuescan;
  
! 	/*
! 	 * We must have the buffer pinned and locked, but the usual macro can't be
! 	 * used here; this function is what makes it good for currPos.
! 	 */
  	Assert(BufferIsValid(so->currPos.buf));
  
  	page = BufferGetPage(so->currPos.buf);
***************
*** 1144,1149 **** _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
--- 1173,1191 ----
  	maxoff = PageGetMaxOffsetNumber(page);
  
  	/*
+ 	 * We note the buffer's block number so that we can release the pin later.
+ 	 * This allows us to re-read the buffer if it is needed again for hinting.
+ 	 */
+ 	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+ 
+ 	/*
+ 	 * We save the LSN of the page as we read it, so that we know whether it
+ 	 * safe to apply LP_DEAD hints to the page later.  This allows us to drop
+ 	 * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ 	 */
+ 	so->currPos.lsn = PageGetLSN(page);
+ 
+ 	/*
  	 * we must save the page's right-link while scanning it; this tells us
  	 * where to step right to after we're done with these items.  There is no
  	 * corresponding need for the left-link, since splits always go right.
***************
*** 1153,1158 **** _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
--- 1195,1206 ----
  	/* initialize tuple workspace to empty */
  	so->currPos.nextTupleOffset = 0;
  
+ 	/*
+ 	 * Now that the current page has been made consistent, the macro should be
+ 	 * good.
+ 	 */
+ 	Assert(BTScanPosIsPinned(so->currPos));
+ 
  	if (ScanDirectionIsForward(dir))
  	{
  		/* load items[] in ascending order */
***************
*** 1241,1251 **** _bt_saveitem(BTScanOpaque so, int itemIndex,
  /*
   *	_bt_steppage() -- Step to next page containing valid data for scan
   *
!  * On entry, so->currPos.buf must be pinned and read-locked.  We'll drop
!  * the lock and pin before moving to next page.
   *
!  * On success exit, we hold pin and read-lock on the next interesting page,
!  * and so->currPos is updated to contain data from that page.
   *
   * If there are no more matching records in the given direction, we drop all
   * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
--- 1289,1302 ----
  /*
   *	_bt_steppage() -- Step to next page containing valid data for scan
   *
!  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
!  * if pinned, we'll drop the pin before moving to next page.  The buffer is
!  * not locked on entry.
   *
!  * On success exit, so->currPos is updated to contain data from the next
!  * interesting page.  For success on a scan using a non-MVCC snapshot we hold
!  * a pin, but not a read lock, on that page.  If we do not hold the pin, we
!  * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success.
   *
   * If there are no more matching records in the given direction, we drop all
   * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
***************
*** 1254,1289 **** static bool
  _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  {
  	BTScanOpaque so = (BTScanOpaque) scan->opaque;
! 	Relation	rel;
  	Page		page;
  	BTPageOpaque opaque;
  
! 	/* we must have the buffer pinned and locked */
! 	Assert(BufferIsValid(so->currPos.buf));
  
  	/* Before leaving current page, deal with any killed items */
  	if (so->numKilled > 0)
! 		_bt_killitems(scan, true);
! 
! 	/*
! 	 * Before we modify currPos, make a copy of the page data if there was a
! 	 * mark position that needs it.
! 	 */
! 	if (so->markItemIndex >= 0)
! 	{
! 		/* bump pin on current buffer for assignment to mark buffer */
! 		IncrBufferRefCount(so->currPos.buf);
! 		memcpy(&so->markPos, &so->currPos,
! 			   offsetof(BTScanPosData, items[1]) +
! 			   so->currPos.lastItem * sizeof(BTScanPosItem));
! 		if (so->markTuples)
! 			memcpy(so->markTuples, so->currTuples,
! 				   so->currPos.nextTupleOffset);
! 		so->markPos.itemIndex = so->markItemIndex;
! 		so->markItemIndex = -1;
! 	}
! 
! 	rel = scan->indexRelation;
  
  	if (ScanDirectionIsForward(dir))
  	{
--- 1305,1319 ----
  _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  {
  	BTScanOpaque so = (BTScanOpaque) scan->opaque;
! 	Relation	rel = scan->indexRelation;
  	Page		page;
  	BTPageOpaque opaque;
  
! 	Assert(BTScanPosIsValid(so->currPos));
  
  	/* Before leaving current page, deal with any killed items */
  	if (so->numKilled > 0)
! 		_bt_killitems(scan);
  
  	if (ScanDirectionIsForward(dir))
  	{
***************
*** 1294,1307 **** _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  		/* Remember we left a page with data */
  		so->currPos.moreLeft = true;
  
  		for (;;)
  		{
- 			/* release the previous buffer */
- 			_bt_relbuf(rel, so->currPos.buf);
- 			so->currPos.buf = InvalidBuffer;
  			/* if we're at end of scan, give up */
  			if (blkno == P_NONE || !so->currPos.moreRight)
  				return false;
  			/* check for interrupts while we're not holding any buffer lock */
  			CHECK_FOR_INTERRUPTS();
  			/* step right one page */
--- 1324,1340 ----
  		/* Remember we left a page with data */
  		so->currPos.moreLeft = true;
  
+ 		/* release the previous buffer, if pinned */
+ 		BTScanPosUnpinIfPinned(so->currPos);
+ 
  		for (;;)
  		{
  			/* if we're at end of scan, give up */
  			if (blkno == P_NONE || !so->currPos.moreRight)
+ 			{
+ 				BTScanPosInvalidate(so->currPos);
  				return false;
+ 			}
  			/* check for interrupts while we're not holding any buffer lock */
  			CHECK_FOR_INTERRUPTS();
  			/* step right one page */
***************
*** 1317,1324 **** _bt_steppage(IndexScanDesc scan, ScanDirection dir)
--- 1350,1359 ----
  				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
  					break;
  			}
+ 
  			/* nope, keep going */
  			blkno = opaque->btpo_next;
+ 			_bt_relbuf(rel, so->currPos.buf);
  		}
  	}
  	else
***************
*** 1326,1331 **** _bt_steppage(IndexScanDesc scan, ScanDirection dir)
--- 1361,1372 ----
  		/* Remember we left a page with data */
  		so->currPos.moreRight = true;
  
+ 		/* FIXME: Walking left needs to be lighter on the locking and pins? */
+ 		if (BTScanPosIsPinned(so->currPos))
+ 			LockBuffer(so->currPos.buf, BUFFER_LOCK_SHARE);
+ 		else
+ 			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+ 
  		/*
  		 * Walk left to the next page with data.  This is much more complex
  		 * than the walk-right case because of the possibility that the page
***************
*** 1339,1345 **** _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  			if (!so->currPos.moreLeft)
  			{
  				_bt_relbuf(rel, so->currPos.buf);
! 				so->currPos.buf = InvalidBuffer;
  				return false;
  			}
  
--- 1380,1386 ----
  			if (!so->currPos.moreLeft)
  			{
  				_bt_relbuf(rel, so->currPos.buf);
! 				BTScanPosInvalidate(so->currPos);
  				return false;
  			}
  
***************
*** 1348,1354 **** _bt_steppage(IndexScanDesc scan, ScanDirection dir)
--- 1389,1398 ----
  
  			/* if we're physically at end of index, return failure */
  			if (so->currPos.buf == InvalidBuffer)
+ 			{
+ 				BTScanPosInvalidate(so->currPos);
  				return false;
+ 			}
  
  			/*
  			 * Okay, we managed to move left to a non-deleted page. Done if
***************
*** 1368,1373 **** _bt_steppage(IndexScanDesc scan, ScanDirection dir)
--- 1412,1420 ----
  		}
  	}
  
+ 	/* Drop the lock, and maybe the pin, on the current page */
+ 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ 
  	return true;
  }
  
***************
*** 1604,1610 **** _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
  		 * exists.
  		 */
  		PredicateLockRelation(rel, scan->xs_snapshot);
! 		so->currPos.buf = InvalidBuffer;
  		return false;
  	}
  
--- 1651,1657 ----
  		 * exists.
  		 */
  		PredicateLockRelation(rel, scan->xs_snapshot);
! 		BTScanPosInvalidate(so->currPos);
  		return false;
  	}
  
***************
*** 1647,1653 **** _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
  		so->currPos.moreRight = false;
  	}
  	so->numKilled = 0;			/* just paranoia */
- 	so->markItemIndex = -1;		/* ditto */
  
  	/*
  	 * Now load data from the first page of the scan.
--- 1694,1699 ----
***************
*** 1658,1669 **** _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
  		 * There's no actually-matching data on this page.  Try to advance to
  		 * the next page.  Return false if there's no matching data at all.
  		 */
  		if (!_bt_steppage(scan, dir))
  			return false;
  	}
! 
! 	/* Drop the lock, but not pin, on the current page */
! 	LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
  
  	/* OK, itemIndex says what to return */
  	currItem = &so->currPos.items[so->currPos.itemIndex];
--- 1704,1718 ----
  		 * There's no actually-matching data on this page.  Try to advance to
  		 * the next page.  Return false if there's no matching data at all.
  		 */
+ 		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
  		if (!_bt_steppage(scan, dir))
  			return false;
  	}
! 	else
! 	{
! 		/* Drop the lock, and maybe the pin, on the current page */
! 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
! 	}
  
  	/* OK, itemIndex says what to return */
  	currItem = &so->currPos.items[so->currPos.itemIndex];
*** a/src/backend/access/nbtree/nbtutils.c
--- b/src/backend/access/nbtree/nbtutils.c
***************
*** 1724,1732 **** _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
   * scan->so contains information about the current page and killed tuples
   * thereon (generally, this should only be called if so->numKilled > 0).
   *
!  * The caller must have pin on so->currPos.buf, but may or may not have
!  * read-lock, as indicated by haveLock.  Note that we assume read-lock
!  * is sufficient for setting LP_DEAD status (which is only a hint).
   *
   * We match items by heap TID before assuming they are the right ones to
   * delete.  We cope with cases where items have moved right due to insertions.
--- 1724,1732 ----
   * scan->so contains information about the current page and killed tuples
   * thereon (generally, this should only be called if so->numKilled > 0).
   *
!  * The caller may or may not have a pin on so->currPos.buf.  Note that we
!  * assume read-lock is sufficient for setting LP_DEAD status (which is only a
!  * hint).
   *
   * We match items by heap TID before assuming they are the right ones to
   * delete.  We cope with cases where items have moved right due to insertions.
***************
*** 1741,1747 **** _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
   * recycled.)
   */
  void
! _bt_killitems(IndexScanDesc scan, bool haveLock)
  {
  	BTScanOpaque so = (BTScanOpaque) scan->opaque;
  	Page		page;
--- 1741,1747 ----
   * recycled.)
   */
  void
! _bt_killitems(IndexScanDesc scan)
  {
  	BTScanOpaque so = (BTScanOpaque) scan->opaque;
  	Page		page;
***************
*** 1749,1767 **** _bt_killitems(IndexScanDesc scan, bool haveLock)
  	OffsetNumber minoff;
  	OffsetNumber maxoff;
  	int			i;
  	bool		killedsomething = false;
  
! 	Assert(BufferIsValid(so->currPos.buf));
  
! 	if (!haveLock)
  		LockBuffer(so->currPos.buf, BT_READ);
  
! 	page = BufferGetPage(so->currPos.buf);
  	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  	minoff = P_FIRSTDATAKEY(opaque);
  	maxoff = PageGetMaxOffsetNumber(page);
  
! 	for (i = 0; i < so->numKilled; i++)
  	{
  		int			itemIndex = so->killedItems[i];
  		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
--- 1749,1800 ----
  	OffsetNumber minoff;
  	OffsetNumber maxoff;
  	int			i;
+ 	int			numKilled = so->numKilled;
  	bool		killedsomething = false;
  
! 	Assert(BTScanPosIsValid(so->currPos));
  
! 	/*
! 	 * Always reset the scan state, so we don't look for same items on other
! 	 * pages.
! 	 */
! 	so->numKilled = 0;
! 
! 	if (BTScanPosIsPinned(so->currPos))
! 	{
! 		/* The buffer is still pinned, but not locked.  Lock it now. */
  		LockBuffer(so->currPos.buf, BT_READ);
  
! 		/* Since the else condition needs page, get it here, too. */
! 		page = BufferGetPage(so->currPos.buf);
! 	}
! 	else
! 	{
! 		Buffer		buf;
! 
! 		/* Attempt to re-read the buffer, getting pin and lock. */
! 		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
! 
! 		/* It might not exist anymore; in which case we can't hint it. */
! 		if (!BufferIsValid(buf))
! 			return;
! 
! 		page = BufferGetPage(buf);
! 		if (PageGetLSN(page) == so->currPos.lsn)
! 			so->currPos.buf = buf;
! 		else
! 		{
! 			/* Modified while not pinned means hinting is not safe. */
! 			_bt_relbuf(scan->indexRelation, buf);
! 			return;
! 		}
! 	}
! 
  	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  	minoff = P_FIRSTDATAKEY(opaque);
  	maxoff = PageGetMaxOffsetNumber(page);
  
! 	for (i = 0; i < numKilled; i++)
  	{
  		int			itemIndex = so->killedItems[i];
  		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
***************
*** 1799,1812 **** _bt_killitems(IndexScanDesc scan, bool haveLock)
  		MarkBufferDirtyHint(so->currPos.buf, true);
  	}
  
! 	if (!haveLock)
! 		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
! 
! 	/*
! 	 * Always reset the scan state, so we don't look for same items on other
! 	 * pages.
! 	 */
! 	so->numKilled = 0;
  }
  
  
--- 1832,1838 ----
  		MarkBufferDirtyHint(so->currPos.buf, true);
  	}
  
! 	LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
  }
  
  
*** a/src/include/access/nbtree.h
--- b/src/include/access/nbtree.h
***************
*** 518,523 **** typedef struct BTScanPosData
--- 518,525 ----
  {
  	Buffer		buf;			/* if valid, the buffer is pinned */
  
+ 	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
+ 	BlockNumber currPage;		/* page we've referencd by items array */
  	BlockNumber nextPage;		/* page's right link when we scanned it */
  
  	/*
***************
*** 551,557 **** typedef struct BTScanPosData
  
  typedef BTScanPosData *BTScanPos;
  
! #define BTScanPosIsValid(scanpos) BufferIsValid((scanpos).buf)
  
  /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
  typedef struct BTArrayKeyInfo
--- 553,589 ----
  
  typedef BTScanPosData *BTScanPos;
  
! #define BTScanPosIsPinned(scanpos) \
! ( \
! 	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
! 				!BufferIsValid((scanpos).buf)), \
! 	BufferIsValid((scanpos).buf) \
! )
! #define BTScanPosUnpin(scanpos) \
! 	do { \
! 		ReleaseBuffer((scanpos).buf); \
! 		(scanpos).buf = InvalidBuffer; \
! 	} while (0)
! #define BTScanPosUnpinIfPinned(scanpos) \
! 	do { \
! 		if (BTScanPosIsPinned(scanpos)) \
! 			BTScanPosUnpin(scanpos); \
! 	} while (0)
! 
! #define BTScanPosIsValid(scanpos) \
! ( \
! 	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
! 				!BufferIsValid((scanpos).buf)), \
! 	BlockNumberIsValid((scanpos).currPage) \
! )
! #define BTScanPosInvalidate(scanpos) \
! 	do { \
! 		(scanpos).currPage = InvalidBlockNumber; \
! 		(scanpos).nextPage = InvalidBlockNumber; \
! 		(scanpos).buf = InvalidBuffer; \
! 		(scanpos).lsn = InvalidXLogRecPtr; \
! 		(scanpos).nextTupleOffset = 0; \
! 	} while (0);
  
  /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
  typedef struct BTArrayKeyInfo
***************
*** 589,603 **** typedef struct BTScanOpaqueData
  	char	   *currTuples;		/* tuple storage for currPos */
  	char	   *markTuples;		/* tuple storage for markPos */
  
- 	/*
- 	 * If the marked position is on the same page as current position, we
- 	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
- 	 * (all the rest of currPos is valid for the mark position). Hence, to
- 	 * determine if there is a mark, first look at markItemIndex, then at
- 	 * markPos.
- 	 */
- 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
- 
  	/* keep these last in struct for efficiency */
  	BTScanPosData currPos;		/* current position data */
  	BTScanPosData markPos;		/* marked position, if any */
--- 621,626 ----
***************
*** 697,703 **** extern void _bt_preprocess_keys(IndexScanDesc scan);
  extern IndexTuple _bt_checkkeys(IndexScanDesc scan,
  			  Page page, OffsetNumber offnum,
  			  ScanDirection dir, bool *continuescan);
! extern void _bt_killitems(IndexScanDesc scan, bool haveLock);
  extern BTCycleId _bt_vacuum_cycleid(Relation rel);
  extern BTCycleId _bt_start_vacuum(Relation rel);
  extern void _bt_end_vacuum(Relation rel);
--- 720,726 ----
  extern IndexTuple _bt_checkkeys(IndexScanDesc scan,
  			  Page page, OffsetNumber offnum,
  			  ScanDirection dir, bool *continuescan);
! extern void _bt_killitems(IndexScanDesc scan);
  extern BTCycleId _bt_vacuum_cycleid(Relation rel);
  extern BTCycleId _bt_start_vacuum(Relation rel);
  extern void _bt_end_vacuum(Relation rel);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to