Hi,
On Tue, Apr 4, 2017 at 3:56 PM, Amit Kapila <[email protected]> wrote:
> On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <[email protected]> wrote:
>>
>> Please note that these patches needs to be applied on top of [1].
>>
>
> Few more review comments:
>
> 1.
> - page = BufferGetPage(so->hashso_curbuf);
> + blkno = so->currPos.currPage;
> + if (so->hashso_bucket_buf == so->currPos.buf)
> + {
> + buf = so->currPos.buf;
> + LockBuffer(buf, BUFFER_LOCK_SHARE);
> + page = BufferGetPage(buf);
> + }
>
> Here, you are assuming that only bucket page can be pinned, but I
> think that assumption is not right. When _hash_kill_items() is called
> before moving to next page, there could be a pin on the overflow page.
> You need some logic to check if the buffer is pinned, then just lock
> it. I think once you do that, it might not be convinient to release
> the pin at the end of this function.
Yes, there are few cases where we might have pin on overflow pages too
and we need to handle such cases in _hash_kill_items. I have taken
care of this in the attached v7 patch. Thanks.
>
> Add some comments on top of _hash_kill_items to explain the new
> changes or say some thing like "See _bt_killitems for details"
Added some more comments on the new changes.
>
> 2.
> + /*
> + * We save the LSN of the page as we read it, so that we know whether it
> + * safe to apply LP_DEAD hints to the page later. This allows us to drop
> + * the pin for MVCC scans, which allows vacuum to avoid blocking.
> + */
> + so->currPos.lsn = PageGetLSN(page);
> +
>
> The second part of above comment doesn't make sense because vacuum's
> will still be blocked because we hold the pin on primary bucket page.
That's right. It doesn't make sense because we won't allow vacuum to
start. I have removed it.
>
> 3.
> + {
> + /*
> + * No more matching tuples were found. return FALSE
> + * indicating the same. Also, remember the prev and
> + * next block number so that if fetching tuples using
> + * cursor we remember the page from where to start the
> + * scan.
> + */
> + so->currPos.prevPage = (opaque)->hasho_prevblkno;
> + so->currPos.nextPage = (opaque)->hasho_nextblkno;
>
> You can't read opaque without having lock and by this time it has
> already been released.
I have corrected it. Please refer to the attached v7 patch.
Also, I think if you want to save position for
> cursor movement, then you need to save the position of last bucket
> when scan completes the overflow chain, however as you have written it
> will be always invalid block number. I think there is similar problem
> with prevblock number.
Did you mean last bucket or last page. If it is last page, then I am
already storing it.
>
> 4.
> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber
> offnum,
> + OffsetNumber maxoff, ScanDirection dir)
> +{
> + HashScanOpaque so = (HashScanOpaque) scan->opaque;
> + IndexTuple itup;
> + int itemIndex;
> +
> + if (ScanDirectionIsForward(dir))
> + {
> + /* load items[] in ascending order */
> + itemIndex = 0;
> +
> + /* new page, relocate starting position by binary search */
> + offnum = _hash_binsearch(page, so->hashso_sk_hash);
>
> What is the need to find offset number when this function already has
> that as an input parameter?
It's not required. I have removed it.
>
> 5.
> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber
> offnum,
> + OffsetNumber maxoff, ScanDirection dir)
>
> I think maxoff is not required to be passed a parameter to this
> function as you need it only for forward scan. You can compute it
> locally.
It is required for both forward and backward scan. However, I am not
passing it to _hash_load_qualified_items() now.
>
> 6.
> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber
> offnum,
> + OffsetNumber maxoff, ScanDirection dir)
> {
> ..
> + if (ScanDirectionIsForward(dir))
> + {
> ..
> + while (offnum <= maxoff)
> {
> ..
> + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
> + _hash_checkqual(scan, itup))
> + {
> + /* tuple is qualified, so remember it */
> + _hash_saveitem(so, itemIndex, offnum, itup);
> + itemIndex++;
> + }
> +
> + offnum = OffsetNumberNext(offnum);
> ..
>
> Why are you traversing the whole page even when there is no match?
> There is a similar problem in backward scan. I think this is blunder.
Fixed. Please check the attached
'0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patch'
>
> 7.
> + if (so->currPos.buf == so->hashso_bucket_buf ||
> + so->currPos.buf == so->hashso_split_bucket_buf)
> + {
> + so->currPos.prevPage = InvalidBlockNumber;
> + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
> + }
> + else
> + {
> + so->currPos.prevPage = (opaque)->hasho_prevblkno;
> + _hash_relbuf(rel, so->currPos.buf);
> + }
> +
> + so->currPos.nextPage = (opaque)->hasho_nextblkno;
>
> What makes you think it is safe read opaque after releasing the lock?
Nothing makes me think to read opaque after releasing lock. It's a
mistake. I have corrected it. Please check attached v7 patch.
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From c500a917a881948ccd373dcf65942b796abb6dda Mon Sep 17 00:00:00 2001
From: ashu <[email protected]>
Date: Mon, 8 May 2017 18:21:03 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev7
Patch by Ashutosh Sharma <[email protected]>
---
src/backend/access/hash/README | 25 +-
src/backend/access/hash/hash.c | 156 +++--------
src/backend/access/hash/hashpage.c | 10 +-
src/backend/access/hash/hashsearch.c | 491 ++++++++++++++++++++++++++++++++---
src/backend/access/hash/hashutil.c | 70 ++++-
src/include/access/hash.h | 50 +++-
6 files changed, 609 insertions(+), 193 deletions(-)
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..eef7d66 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
-- then, per read request:
reacquire content lock on current page
step to next page if necessary (no chaining of content locks, but keep
- the pin on the primary bucket throughout the scan; we also maintain
- a pin on the page currently being scanned)
- get tuple
- release content lock
+ the pin on the primary bucket throughout the scan)
+ save all the matching tuples from current index page into an items array
+ release pin and content lock (but if it is primary bucket page retain
+ it's pin till the end of scan)
+ get tuple from an item array
-- at scan shutdown:
release all pins still held
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
the reader's current-tuple pointer from being invalidated by splits or
compactions. (Of course, other buckets can still be split or compacted.)
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page. Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned. Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches entire hash
+page to identify all the matching items at once, copying their heap tuple IDs
+into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on a same index page without any requirement of re-finding the current
+scan position for reader. We do continue to hold a pin on the bucket page, to
+protect against concurrent deletions and bucket split.
To allow for scans during a bucket split, if at the start of the scan, the
bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3eb5b1d..4c60af1 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,22 @@ bool
hashgettuple(IndexScanDesc scan, ScanDirection dir)
{
HashScanOpaque so = (HashScanOpaque) scan->opaque;
- Relation rel = scan->indexRelation;
- Buffer buf;
- Page page;
- OffsetNumber offnum;
- ItemPointer current;
bool res;
+ HashScanPosItem *currItem;
/* Hash indexes are always lossy since we store only the hash code */
scan->xs_recheck = true;
/*
- * We hold pin but not lock on current buffer while outside the hash AM.
- * Reacquire the read lock here.
- */
- if (BufferIsValid(so->hashso_curbuf))
- LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
- /*
* If we've already initialized this scan, we can just advance it in the
* appropriate direction. If we haven't done so yet, we call a routine to
* get the first item in the scan.
*/
- current = &(so->hashso_curpos);
- if (ItemPointerIsValid(current))
+ if (!HashScanPosIsValid(so->currPos))
+ res = _hash_first(scan, dir);
+ else
{
/*
- * An insertion into the current index page could have happened while
- * we didn't have read lock on it. Re-find our position by looking
- * for the TID we previously returned. (Because we hold a pin on the
- * primary bucket page, no deletions or splits could have occurred;
- * therefore we can expect that the TID still exists in the current
- * index page, at an offset >= where we were.)
- */
- OffsetNumber maxoffnum;
-
- buf = so->hashso_curbuf;
- Assert(BufferIsValid(buf));
- page = BufferGetPage(buf);
-
- /*
- * We don't need test for old snapshot here as the current buffer is
- * pinned, so vacuum can't clean the page.
- */
- maxoffnum = PageGetMaxOffsetNumber(page);
- for (offnum = ItemPointerGetOffsetNumber(current);
- offnum <= maxoffnum;
- offnum = OffsetNumberNext(offnum))
- {
- IndexTuple itup;
-
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
- break;
- }
- if (offnum > maxoffnum)
- elog(ERROR, "failed to re-find scan position within index \"%s\"",
- RelationGetRelationName(rel));
- ItemPointerSetOffsetNumber(current, offnum);
-
- /*
* Check to see if we should kill the previously-fetched tuple.
*/
if (scan->kill_prior_tuple)
@@ -341,16 +297,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
* instead, we just forget any excess entries.
*/
if (so->killedItems == NULL)
- so->killedItems = palloc(MaxIndexTuplesPerPage *
- sizeof(HashScanPosItem));
+ so->killedItems = (int *)
+ palloc(MaxIndexTuplesPerPage * sizeof(int));
if (so->numKilled < MaxIndexTuplesPerPage)
- {
- so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
- so->killedItems[so->numKilled].indexOffset =
- ItemPointerGetOffsetNumber(&(so->hashso_curpos));
- so->numKilled++;
- }
+ so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
/*
@@ -358,30 +309,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
*/
res = _hash_next(scan, dir);
}
- else
- res = _hash_first(scan, dir);
-
- /*
- * Skip killed tuples if asked to.
- */
- if (scan->ignore_killed_tuples)
- {
- while (res)
- {
- offnum = ItemPointerGetOffsetNumber(current);
- page = BufferGetPage(so->hashso_curbuf);
- if (!ItemIdIsDead(PageGetItemId(page, offnum)))
- break;
- res = _hash_next(scan, dir);
- }
- }
-
- /* Release read lock on current buffer, but keep it pinned */
- if (BufferIsValid(so->hashso_curbuf))
- LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
/* Return current heap TID on success */
- scan->xs_ctup.t_self = so->hashso_heappos;
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
return res;
}
@@ -396,35 +327,22 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
HashScanOpaque so = (HashScanOpaque) scan->opaque;
bool res;
int64 ntids = 0;
+ HashScanPosItem *currItem;
res = _hash_first(scan, ForwardScanDirection);
while (res)
{
- bool add_tuple;
+ currItem = &so->currPos.items[so->currPos.itemIndex];
/*
- * Skip killed tuples if asked to.
+ * _hash_first() or _hash_next() never returns
+ * dead tuples. Therefore, we can always add
+ * the tuples into TIDBitmap without checking
+ * if a tuple is dead or not.
*/
- if (scan->ignore_killed_tuples)
- {
- Page page;
- OffsetNumber offnum;
-
- offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
- page = BufferGetPage(so->hashso_curbuf);
- add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
- }
- else
- add_tuple = true;
-
- /* Save tuple ID, and continue scanning */
- if (add_tuple)
- {
- /* Note we mark the tuple ID as requiring recheck */
- tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
- ntids++;
- }
+ tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+ ntids++;
res = _hash_next(scan, ForwardScanDirection);
}
@@ -448,12 +366,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
scan = RelationGetIndexScan(rel, nkeys, norderbys);
so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
- so->hashso_curbuf = InvalidBuffer;
+ HashScanPosInvalidate(so->currPos);
so->hashso_bucket_buf = InvalidBuffer;
so->hashso_split_bucket_buf = InvalidBuffer;
- /* set position invalid (this will cause _hash_first call) */
- ItemPointerSetInvalid(&(so->hashso_curpos));
- ItemPointerSetInvalid(&(so->hashso_heappos));
so->hashso_buc_populated = false;
so->hashso_buc_split = false;
@@ -476,23 +391,16 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
HashScanOpaque so = (HashScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
- /*
- * Before leaving current page, deal with any killed items.
- * Also, ensure that we acquire lock on current page before
- * calling _hash_kill_items.
- */
- if (so->numKilled > 0)
+ if (HashScanPosIsValid(so->currPos))
{
- LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
- _hash_kill_items(scan);
- LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+ /* Before leaving current page, deal with any killed items */
+ if (so->numKilled > 0)
+ _hash_kill_items(scan, false);
+ _hash_dropscanbuf(rel, so);
}
- _hash_dropscanbuf(rel, so);
-
/* set position invalid (this will cause _hash_first call) */
- ItemPointerSetInvalid(&(so->hashso_curpos));
- ItemPointerSetInvalid(&(so->hashso_heappos));
+ HashScanPosInvalidate(so->currPos);
/* Update scan key, if a new one is given */
if (scankey && scan->numberOfKeys > 0)
@@ -515,20 +423,14 @@ hashendscan(IndexScanDesc scan)
HashScanOpaque so = (HashScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
- /*
- * Before leaving current page, deal with any killed items.
- * Also, ensure that we acquire lock on current page before
- * calling _hash_kill_items.
- */
- if (so->numKilled > 0)
+ if (HashScanPosIsValid(so->currPos))
{
- LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
- _hash_kill_items(scan);
- LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+ /* Before leaving current page, deal with any killed items */
+ if (so->numKilled > 0)
+ _hash_kill_items(scan, false);
+ _hash_dropscanbuf(rel, so);
}
- _hash_dropscanbuf(rel, so);
-
if (so->killedItems != NULL)
pfree(so->killedItems);
pfree(so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 3cd4daa..5861b82 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
{
/* release pin we hold on primary bucket page */
if (BufferIsValid(so->hashso_bucket_buf) &&
- so->hashso_bucket_buf != so->hashso_curbuf)
+ so->hashso_bucket_buf != so->currPos.buf)
_hash_dropbuf(rel, so->hashso_bucket_buf);
so->hashso_bucket_buf = InvalidBuffer;
/* release pin we hold on primary bucket page of bucket being split */
if (BufferIsValid(so->hashso_split_bucket_buf) &&
- so->hashso_split_bucket_buf != so->hashso_curbuf)
+ so->hashso_split_bucket_buf != so->currPos.buf)
_hash_dropbuf(rel, so->hashso_split_bucket_buf);
so->hashso_split_bucket_buf = InvalidBuffer;
/* release any pin we still hold */
- if (BufferIsValid(so->hashso_curbuf))
- _hash_dropbuf(rel, so->hashso_curbuf);
- so->hashso_curbuf = InvalidBuffer;
+ if (BufferIsValid(so->currPos.buf))
+ _hash_dropbuf(rel, so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
/* reset split scan */
so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 2d92049..a108ba1 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,156 @@
#include "pgstat.h"
#include "utils/rel.h"
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+ ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+ OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+ OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+ Page *pagep, HashPageOpaque *opaquep);
/*
* _hash_next() -- Get the next item in a scan.
*
- * On entry, we have a valid hashso_curpos in the scan, and a
- * pin and read lock on the page that contains that item.
- * We find the next item in the scan, if any.
- * On success exit, we have the page containing the next item
- * pinned and locked.
+ * On entry, so->currPos describes the current page, which may
+ * be pinned but not locked, and so->currPos.itemIndex identifies
+ * which item was previously returned.
+ *
+ * On successful exit, scan->xs_ctup.t_self is set to the TID
+ * of the next heap tuple, and if requested, scan->xs_itup
+ * points to a copy of the index tuple. so->currPos is updated
+ * as needed.
+ *
+ * On failure exit (no more tuples), we return FALSE with no
+ * pins or locks held.
*/
bool
_hash_next(IndexScanDesc scan, ScanDirection dir)
{
Relation rel = scan->indexRelation;
HashScanOpaque so = (HashScanOpaque) scan->opaque;
+ HashScanPosItem *currItem;
+ BlockNumber blkno;
Buffer buf;
Page page;
- OffsetNumber offnum;
- ItemPointer current;
- IndexTuple itup;
-
- /* we still have the buffer pinned and read-locked */
- buf = so->hashso_curbuf;
- Assert(BufferIsValid(buf));
+ HashPageOpaque opaque;
+ bool end_of_scan = false;
/*
- * step to next valid tuple.
+ * Advance to next tuple on current page; or if there's no more,
+ * try to read data from next or prev page based on the scan
+ * direction. Before moving to the next or prev page make sure
+ * that we deal with all the killed items.
*/
- if (!_hash_step(scan, &buf, dir))
+ if (ScanDirectionIsForward(dir))
+ {
+ if (++so->currPos.itemIndex > so->currPos.lastItem)
+ {
+ if (so->numKilled > 0)
+ _hash_kill_items(scan, false);
+
+ blkno = so->currPos.nextPage;
+ if (BlockNumberIsValid(blkno))
+ {
+ buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+ so->currPos.buf = buf;
+ TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+ if (!_hash_readpage(scan, &buf, dir))
+ end_of_scan = true;
+ }
+ else if (so->hashso_buc_populated && !so->hashso_buc_split)
+ {
+ /*
+ * end of bucket, scan bucket being populated if there was a
+ * split in progress at the start of scan.
+ */
+ buf = so->currPos.buf = so->hashso_split_bucket_buf;
+ Assert(BufferIsValid(buf));
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ /*
+ * setting hashso_buc_split to true indicates that we are
+ * scanning bucket being split.
+ */
+ so->hashso_buc_split = true;
+
+ TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+ if (!_hash_readpage(scan, &buf, dir))
+ end_of_scan = true;
+ }
+ else
+ end_of_scan = true;
+ }
+ }
+ else
+ {
+ if (--so->currPos.itemIndex < so->currPos.firstItem)
+ {
+ if (so->numKilled > 0)
+ _hash_kill_items(scan, false);
+
+ blkno = so->currPos.prevPage;
+ if (BlockNumberIsValid(blkno))
+ {
+ buf = _hash_getbuf(rel, blkno, HASH_READ,
+ LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+ so->currPos.buf = buf;
+ TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+ /*
+ * We always maintain the pin on bucket page for whole scan
+ * operation, so releasing the additional pin we have acquired
+ * here.
+ */
+ if (buf == so->hashso_bucket_buf || buf == so->hashso_split_bucket_buf)
+ _hash_dropbuf(rel, buf);
+
+ if (!_hash_readpage(scan, &buf, dir))
+ end_of_scan = true;
+ }
+ else if (so->hashso_buc_populated && so->hashso_buc_split)
+ {
+ /*
+ * end of bucket, scan bucket being populated if there was a
+ * split in progress at the start of scan.
+ */
+ buf = so->hashso_split_bucket_buf;
+ Assert(BufferIsValid(buf));
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ /* move to the end of bucket chain */
+ while (BlockNumberIsValid(opaque->hasho_nextblkno))
+ _hash_readnext(scan, &buf, &page, &opaque);
+
+ /*
+ * setting hashso_buc_split to false indicates that we are
+ * scanning bucket being split.
+ */
+
+ so->hashso_buc_split = false;
+
+ if (!_hash_readpage(scan, &buf, dir))
+ end_of_scan = true;
+ }
+ else
+ end_of_scan = true;
+ }
+ }
+
+ if (end_of_scan)
+ {
+ _hash_dropscanbuf(rel, so);
+ HashScanPosInvalidate(so->currPos);
return false;
+ }
- /* if we're here, _hash_step found a valid tuple */
- current = &(so->hashso_curpos);
- offnum = ItemPointerGetOffsetNumber(current);
- _hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
- page = BufferGetPage(buf);
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- so->hashso_heappos = itup->t_tid;
+ /* OK, itemIndex says what to return */
+ currItem = &so->currPos.items[so->currPos.itemIndex];
+ scan->xs_ctup.t_self = currItem->heapTid;
return true;
}
@@ -212,11 +324,15 @@ _hash_readprev(IndexScanDesc scan,
/*
* _hash_first() -- Find the first item in a scan.
*
- * Find the first item in the index that
- * satisfies the qualification associated with the scan descriptor. On
- * success, the page containing the current index tuple is read locked
- * and pinned, and the scan's opaque data entry is updated to
- * include the buffer.
+ * We find the first item (or, if backward scan, the last item) in
+ * the index that satisfies the qualification associated with the
+ * scan descriptor. On success, the page containing the current
+ * index tuple is read locked and pinned, and data about the
+ * matching tuple(s) on the page has been loaded into so->currPos,
+ * scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ * If there are no matching items in the index, we return FALSE,
+ * with no pins or locks held.
*/
bool
_hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +345,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
Buffer buf;
Page page;
HashPageOpaque opaque;
- IndexTuple itup;
- ItemPointer current;
- OffsetNumber offnum;
pgstat_count_index_scan(rel);
- current = &(so->hashso_curpos);
- ItemPointerSetInvalid(current);
-
/*
* We do not support hash scans with no index qualification, because we
* would have to read the whole index rather than just one bucket. That
@@ -356,17 +466,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
_hash_readnext(scan, &buf, &page, &opaque);
}
- /* Now find the first tuple satisfying the qualification */
- if (!_hash_step(scan, &buf, dir))
- return false;
+ /* remember which buffer we have pinned, if any */
+ Assert(BufferIsInvalid(so->currPos.buf));
+ so->currPos.buf = buf;
- /* if we're here, _hash_step found a valid tuple */
- offnum = ItemPointerGetOffsetNumber(current);
- _hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
- page = BufferGetPage(buf);
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
- so->hashso_heappos = itup->t_tid;
+ /* Now find all the tuples satisfying the qualification from a page */
+ if (!_hash_readpage(scan, &buf, dir))
+ return false;
+ /* if we're here, _hash_readpage found a valid tuples */
return true;
}
@@ -575,3 +683,304 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
ItemPointerSet(current, blkno, offnum);
return true;
}
+
+/*
+ * _hash_readpage() -- Load data from current index page into so->currPos
+ *
+ * We scan all the items in the current index page and save them into
+ * so->currPos if it satifies the qualification. If no matching items
+ * are found in the current page, we move to the next or previous page
+ * in a bucket chain as indicated by the direction.
+ *
+ * Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+ Relation rel = scan->indexRelation;
+ HashScanOpaque so = (HashScanOpaque) scan->opaque;
+ Buffer buf;
+ Page page;
+ HashPageOpaque opaque;
+ OffsetNumber offnum;
+ uint16 itemIndex;
+
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ buf = *bufP;
+ Assert(BufferIsValid(buf));
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ /*
+ * We save the LSN of the page as we read it, so that we
+ * know whether it safe to apply LP_DEAD hints to the
+ * page later.
+ */
+ so->currPos.lsn = PageGetLSN(page);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* new page, locate starting position by binary search */
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+ while (itemIndex == 0)
+ {
+ /*
+ * Could not find any matching tuples in the current page, move
+ * to the next page. Before leaving the current page, also deal
+ * with any killed items.
+ */
+ if (so->numKilled > 0)
+ _hash_kill_items(scan, true);
+
+ /*
+ * We remember prev and next block number along with
+ * current block number so that if fetching the tup-
+ * les using cursor we know the page from where to
+ * startwith. This is for the case where we have re-
+ * ached the end of bucket chain without finding any
+ * matching tuples. See comments in else part below.
+ */
+ if (!BlockNumberIsValid((opaque)->hasho_nextblkno))
+ {
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ so->currPos.nextPage = InvalidBlockNumber;
+ }
+
+ _hash_readnext(scan, &buf, &page, &opaque);
+ if (BufferIsValid(buf))
+ {
+ so->currPos.buf = buf;
+ so->currPos.currPage = BufferGetBlockNumber(buf);
+ so->currPos.lsn = PageGetLSN(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+ itemIndex = _hash_load_qualified_items(scan, page,
+ offnum, dir);
+ }
+ else
+ {
+ /*
+ * No more matching tuples were found. return FALSE
+ * indicating the same.
+ */
+ so->currPos.buf = buf;
+ return false;
+ }
+ }
+
+ if (so->currPos.buf == so->hashso_bucket_buf ||
+ so->currPos.buf == so->hashso_split_bucket_buf)
+ {
+ so->currPos.prevPage = InvalidBlockNumber;
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;
+ _hash_relbuf(rel, so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ so->currPos.firstItem = 0;
+ so->currPos.lastItem = itemIndex - 1;
+ so->currPos.itemIndex = 0;
+ }
+ else
+ {
+ /* new page, locate starting position by binary search */
+ offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+ itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+ while (itemIndex == MaxIndexTuplesPerPage)
+ {
+ /*
+ * Could not find any matching tuples in the current page, move
+ * to the prev page. Before leaving the current page, also deal
+ * with any killed items.
+ */
+ if (so->numKilled > 0)
+ _hash_kill_items(scan, true);
+
+ /*
+ * We remember prev and next block number along with
+ * current block number so that if fetching the tup-
+ * les using cursor we know the page from where to
+ * startwith. This is for the case where we have re-
+ * ached the bucket page without finding any matching
+ * tuples. See comments in else part below.
+ */
+ if (so->currPos.buf == so->hashso_bucket_buf ||
+ so->currPos.buf == so->hashso_split_bucket_buf)
+ {
+ so->currPos.prevPage = InvalidBlockNumber;
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;
+ }
+
+ _hash_readprev(scan, &buf, &page, &opaque);
+ if (BufferIsValid(buf))
+ {
+ so->currPos.buf = buf;
+ so->currPos.currPage = BufferGetBlockNumber(buf);
+ so->currPos.lsn = PageGetLSN(page);
+ offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+ itemIndex = _hash_load_qualified_items(scan, page,
+ offnum, dir);
+ }
+ else
+ {
+ /*
+ * No more matching tuples were found. return FALSE
+ * indicating the same.
+ */
+ so->currPos.buf = buf;
+ return false;
+ }
+ }
+
+ if (so->currPos.buf == so->hashso_bucket_buf ||
+ so->currPos.buf == so->hashso_split_bucket_buf)
+ {
+ so->currPos.prevPage = InvalidBlockNumber;
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;
+ _hash_relbuf(rel, so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
+
+ so->currPos.firstItem = itemIndex;
+ so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ }
+
+ return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+ OffsetNumber offnum, ScanDirection dir)
+{
+ HashScanOpaque so = (HashScanOpaque) scan->opaque;
+ IndexTuple itup;
+ int itemIndex;
+ OffsetNumber maxoff;
+
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ while (offnum <= maxoff)
+ {
+ Assert(offnum >= FirstOffsetNumber);
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress. Also, skip the tuples that are marked
+ * as dead.
+ */
+ if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+ (scan->ignore_killed_tuples &&
+ (ItemIdIsDead(PageGetItemId(page, offnum)))))
+ {
+ offnum = OffsetNumberNext(offnum); /* move forward */
+ continue;
+ }
+
+ if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+ _hash_checkqual(scan, itup))
+ {
+ /* tuple is qualified, so remember it */
+ _hash_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ /*
+ * No more matching tuples exist in this page. so, exit
+ * while loop.
+ */
+ break;
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ Assert(itemIndex <= MaxIndexTuplesPerPage);
+ return itemIndex;
+ }
+ else
+ {
+ /* load items[] in descending order */
+ itemIndex = MaxIndexTuplesPerPage;
+
+ while (offnum >= FirstOffsetNumber)
+ {
+ Assert(offnum <= maxoff);
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress. Also, skip the tuples that are marked
+ * as dead.
+ */
+ if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+ (scan->ignore_killed_tuples &&
+ (ItemIdIsDead(PageGetItemId(page, offnum)))))
+ {
+ offnum = OffsetNumberPrev(offnum); /* move back */
+ continue;
+ }
+
+ if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+ _hash_checkqual(scan, itup))
+ {
+ itemIndex--;
+ /* tuple is qualified, so remember it */
+ _hash_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ /*
+ * No more matching tuples exist in this page. so, exit
+ * while loop.
+ */
+ break;
+
+ offnum = OffsetNumberPrev(offnum);
+ }
+
+ Assert(itemIndex >= 0);
+ return itemIndex;
+ }
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+ OffsetNumber offnum, IndexTuple itup)
+{
+ HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = itup->t_tid;
+ currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 9f832f2..8053072 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,28 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
* current page and killed tuples thereon (generally, this should only be
* called if so->numKilled > 0).
*
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer. Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by havePin.
+ *
* We match items by heap TID before assuming they are the right ones to
* delete.
+ *
+ * Note that we keep pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from the page.
+ *
+ * See _bt_killitems() for more details.
*/
void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, bool havePin)
{
HashScanOpaque so = (HashScanOpaque) scan->opaque;
+ Relation rel = scan->indexRelation;
+ BlockNumber blkno;
+ Buffer buf;
Page page;
HashPageOpaque opaque;
OffsetNumber offnum, maxoff;
@@ -538,6 +553,7 @@ _hash_kill_items(IndexScanDesc scan)
Assert(so->numKilled > 0);
Assert(so->killedItems != NULL);
+ Assert(HashScanPosIsValid(so->currPos));
/*
* Always reset the scan state, so we don't look for same
@@ -545,20 +561,58 @@ _hash_kill_items(IndexScanDesc scan)
*/
so->numKilled = 0;
- page = BufferGetPage(so->hashso_curbuf);
+ blkno = so->currPos.currPage;
+ if (so->hashso_bucket_buf == so->currPos.buf)
+ {
+ buf = so->currPos.buf;
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+ }
+ else
+ {
+ if (!havePin)
+ buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+ else
+ {
+ buf = so->currPos.buf;
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ }
+
+ /* It might not exist anymore; in which case we can't hint it. */
+ if (!BufferIsValid(buf))
+ return;
+
+ /*
+ * If page LSN differs it means that the page was modified since the last
+ * read. killedItems could be not valid so LP_DEAD hints applying is not
+ * safe.
+ */
+ page = BufferGetPage(buf);
+ if (PageGetLSN(page) != so->currPos.lsn)
+ {
+ _hash_relbuf(rel, buf);
+ return;
+ }
+ }
+
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
maxoff = PageGetMaxOffsetNumber(page);
for (i = 0; i < numKilled; i++)
{
- offnum = so->killedItems[i].indexOffset;
+ int itemIndex = so->killedItems[i];
+ HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+ offnum = currItem->indexOffset;
+
+ Assert(itemIndex >= so->currPos.firstItem &&
+ itemIndex <= so->currPos.lastItem);
while (offnum <= maxoff)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+ if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -577,6 +631,12 @@ _hash_kill_items(IndexScanDesc scan)
if (killedsomething)
{
opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
- MarkBufferDirtyHint(so->hashso_curbuf, true);
+ MarkBufferDirtyHint(buf, true);
}
+
+ if (so->hashso_bucket_buf == so->currPos.buf ||
+ havePin)
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ else
+ _hash_relbuf(rel, buf);
}
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index adba224..ab9ebda 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,46 @@ typedef struct HashScanPosItem /* what we remember about each match */
OffsetNumber indexOffset; /* index item's location within page */
} HashScanPosItem;
+typedef struct HashScanPosData
+{
+ Buffer buf; /* if valid, the buffer is pinned */
+ XLogRecPtr lsn; /* pos in the WAL stream when page was read */
+ BlockNumber currPage; /* current hash index page */
+ BlockNumber nextPage; /* next overflow page */
+ BlockNumber prevPage; /* prev overflow or bucket page */
+
+ /*
+ * The items array is always ordered in index order (ie, increasing
+ * indexoffset). When scanning backwards it is convenient to fill the
+ * array back-to-front, so we start at the last slot and fill downwards.
+ * Hence we need both a first-valid-entry and a last-valid-entry counter.
+ * itemIndex is a cursor showing which entry was last returned to caller.
+ */
+ int firstItem; /* first valid index in items[] */
+ int lastItem; /* last valid index in items[] */
+ int itemIndex; /* current index in items[] */
+
+ HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+} HashScanPosData;
+
+#define HashScanPosIsValid(scanpos) \
+( \
+ AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+ !BufferIsValid((scanpos).buf)), \
+ BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+ do { \
+ (scanpos).buf = InvalidBuffer; \
+ (scanpos).lsn = InvalidXLogRecPtr; \
+ (scanpos).currPage = InvalidBlockNumber; \
+ (scanpos).nextPage = InvalidBlockNumber; \
+ (scanpos).prevPage = InvalidBlockNumber; \
+ (scanpos).firstItem = 0; \
+ (scanpos).lastItem = 0; \
+ (scanpos).itemIndex = 0; \
+ } while (0);
/*
* HashScanOpaqueData is private state for a hash index scan.
@@ -145,8 +185,14 @@ typedef struct HashScanOpaqueData
*/
bool hashso_buc_split;
/* info about killed items if any (killedItems is NULL if never used) */
- HashScanPosItem *killedItems; /* tids and offset numbers of killed items */
+ int *killedItems; /* currPos.items indexes of killed items */
int numKilled; /* number of currently stored items */
+
+ /*
+ * Identify all the matching items on a page and save them
+ * in HashScanPosData
+ */
+ HashScanPosData currPos; /* current position data */
} HashScanOpaqueData;
typedef HashScanOpaqueData *HashScanOpaque;
@@ -411,7 +457,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, bool havePin);
/* hash.c */
extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
--
1.8.3.1
From 1ec664ac796b5b631fc772849c6de1aa1737f309 Mon Sep 17 00:00:00 2001
From: ashu <[email protected]>
Date: Sun, 12 Feb 2017 10:52:22 +0530
Subject: [PATCH] Remove redundant function _hash_step() and some of the unused
members of HashScanOpaqueData. The function _hash_step() used to find the
next qualifing tuple in the index page is no more required as new hash index
scan works page at a time which means it reads all the qualifing tuples in a
page at once with the help of a new function _hash_readpage().
Patch by Ashutosh Sharma.
---
src/backend/access/hash/hashsearch.c | 206 -----------------------------------
src/include/access/hash.h | 15 ---
2 files changed, 221 deletions(-)
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 96da9b5..913a996 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -410,212 +410,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
}
/*
- * _hash_step() -- step to the next valid item in a scan in the bucket.
- *
- * If no valid record exists in the requested direction, return
- * false. Else, return true and set the hashso_curpos for the
- * scan to the right thing.
- *
- * Here we need to ensure that if the scan has started during split, then
- * skip the tuples that are moved by split while scanning bucket being
- * populated and then scan the bucket being split to cover all such
- * tuples. This is done to ensure that we don't miss tuples in the scans
- * that are started during split.
- *
- * 'bufP' points to the current buffer, which is pinned and read-locked.
- * On success exit, we have pin and read-lock on whichever page
- * contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
- Relation rel = scan->indexRelation;
- HashScanOpaque so = (HashScanOpaque) scan->opaque;
- ItemPointer current;
- Buffer buf;
- Page page;
- HashPageOpaque opaque;
- OffsetNumber maxoff;
- OffsetNumber offnum;
- BlockNumber blkno;
- IndexTuple itup;
-
- current = &(so->hashso_curpos);
-
- buf = *bufP;
- _hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
- page = BufferGetPage(buf);
- opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
- /*
- * If _hash_step is called from _hash_first, current will not be valid, so
- * we can't dereference it. However, in that case, we presumably want to
- * start at the beginning/end of the page...
- */
- maxoff = PageGetMaxOffsetNumber(page);
- if (ItemPointerIsValid(current))
- offnum = ItemPointerGetOffsetNumber(current);
- else
- offnum = InvalidOffsetNumber;
-
- /*
- * 'offnum' now points to the last tuple we examined (if any).
- *
- * continue to step through tuples until: 1) we get to the end of the
- * bucket chain or 2) we find a valid tuple.
- */
- do
- {
- switch (dir)
- {
- case ForwardScanDirection:
- if (offnum != InvalidOffsetNumber)
- offnum = OffsetNumberNext(offnum); /* move forward */
- else
- {
- /* new page, locate starting position by binary search */
- offnum = _hash_binsearch(page, so->hashso_sk_hash);
- }
-
- for (;;)
- {
- /*
- * check if we're still in the range of items with the
- * target hash key
- */
- if (offnum <= maxoff)
- {
- Assert(offnum >= FirstOffsetNumber);
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
- /*
- * skip the tuples that are moved by split operation
- * for the scan that has started when split was in
- * progress
- */
- if (so->hashso_buc_populated && !so->hashso_buc_split &&
- (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
- {
- offnum = OffsetNumberNext(offnum); /* move forward */
- continue;
- }
-
- if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
- break; /* yes, so exit for-loop */
- }
-
- /* Before leaving current page, deal with any killed items */
- if (so->numKilled > 0)
- _hash_kill_items(scan);
-
- /*
- * ran off the end of this page, try the next
- */
- _hash_readnext(scan, &buf, &page, &opaque);
- if (BufferIsValid(buf))
- {
- maxoff = PageGetMaxOffsetNumber(page);
- offnum = _hash_binsearch(page, so->hashso_sk_hash);
- }
- else
- {
- itup = NULL;
- break; /* exit for-loop */
- }
- }
- break;
-
- case BackwardScanDirection:
- if (offnum != InvalidOffsetNumber)
- offnum = OffsetNumberPrev(offnum); /* move back */
- else
- {
- /* new page, locate starting position by binary search */
- offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
- }
-
- for (;;)
- {
- /*
- * check if we're still in the range of items with the
- * target hash key
- */
- if (offnum >= FirstOffsetNumber)
- {
- Assert(offnum <= maxoff);
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
- /*
- * skip the tuples that are moved by split operation
- * for the scan that has started when split was in
- * progress
- */
- if (so->hashso_buc_populated && !so->hashso_buc_split &&
- (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
- {
- offnum = OffsetNumberPrev(offnum); /* move back */
- continue;
- }
-
- if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
- break; /* yes, so exit for-loop */
- }
-
- /* Before leaving current page, deal with any killed items */
- if (so->numKilled > 0)
- _hash_kill_items(scan);
-
- /*
- * ran off the end of this page, try the next
- */
- _hash_readprev(scan, &buf, &page, &opaque);
- if (BufferIsValid(buf))
- {
- TestForOldSnapshot(scan->xs_snapshot, rel, page);
- maxoff = PageGetMaxOffsetNumber(page);
- offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
- }
- else
- {
- itup = NULL;
- break; /* exit for-loop */
- }
- }
- break;
-
- default:
- /* NoMovementScanDirection */
- /* this should not be reached */
- itup = NULL;
- break;
- }
-
- if (itup == NULL)
- {
- /*
- * We ran off the end of the bucket without finding a match.
- * Release the pin on bucket buffers. Normally, such pins are
- * released at end of scan, however scrolling cursors can
- * reacquire the bucket lock and pin in the same scan multiple
- * times.
- */
- *bufP = so->hashso_curbuf = InvalidBuffer;
- ItemPointerSetInvalid(current);
- _hash_dropscanbuf(rel, so);
- return false;
- }
-
- /* check the tuple quals, loop around if not met */
- } while (!_hash_checkqual(scan, itup));
-
- /* if we made it to here, we've found a valid tuple */
- blkno = BufferGetBlockNumber(buf);
- *bufP = so->hashso_curbuf = buf;
- ItemPointerSet(current, blkno, offnum);
- return true;
-}
-
-/*
* _hash_readpage() -- Load data from current index page into so->currPos
*
* We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 4efed52..4056da5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -150,14 +150,6 @@ typedef struct HashScanOpaqueData
/* Hash value of the scan key, ie, the hash key we seek */
uint32 hashso_sk_hash;
- /*
- * We also want to remember which buffer we're currently examining in the
- * scan. We keep the buffer pinned (but not locked) across hashgettuple
- * calls, in order to avoid doing a ReadBuffer() for every tuple in the
- * index.
- */
- Buffer hashso_curbuf;
-
/* remember the buffer associated with primary bucket */
Buffer hashso_bucket_buf;
@@ -168,12 +160,6 @@ typedef struct HashScanOpaqueData
*/
Buffer hashso_split_bucket_buf;
- /* Current position of the scan, as an index TID */
- ItemPointerData hashso_curpos;
-
- /* Current position of the scan, as a heap TID */
- ItemPointerData hashso_heappos;
-
/* Whether scan starts on bucket being populated due to split */
bool hashso_buc_populated;
@@ -409,7 +395,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
/* hashsearch.c */
extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
/* hashsort.c */
typedef struct HSpool HSpool; /* opaque struct in hashsort.c */
--
1.8.3.1
From 21894a693904a6ec270906fee403880768ef3db5 Mon Sep 17 00:00:00 2001
From: ashu <[email protected]>
Date: Sun, 2 Apr 2017 03:43:20 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index v3
---
src/backend/access/hash/README | 2 +-
src/backend/access/hash/hash.c | 21 ++++++++++-----------
src/backend/access/hash/hashovfl.c | 4 +---
3 files changed, 12 insertions(+), 15 deletions(-)
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 063656d..a3f2445 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -380,8 +380,8 @@ The fourth operation is garbage collection (bulk deletion):
mark the target page dirty
write WAL for deleting tuples from target page
if this is the last bucket page, break out of loop
- pin and x-lock next page
release prior lock and pin (except keep pin on primary bucket page)
+ pin and x-lock next page
if the page we have locked is not the primary bucket page:
release lock and take exclusive lock on primary bucket page
if there are no other pins on the primary bucket page:
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 582163a..4c82868 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -670,11 +670,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
* that the next valid TID will be greater than or equal to the current
* valid TID. There can't be any concurrent scans in progress when we first
* enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be. We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan. To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page. (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
*
* We need to retain a pin on the primary bucket to ensure that no concurrent
* split can start.
@@ -843,19 +841,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
if (!BlockNumberIsValid(blkno))
break;
- next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
- LH_OVERFLOW_PAGE,
- bstrategy);
-
/*
- * release the lock on previous page after acquiring the lock on next
- * page
+ * As the hash index scan work in page at a time mode,
+ * vacuum can release the lock on previous page before
+ * acquiring lock on the next page.
*/
if (retain_pin)
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
else
_hash_relbuf(rel, buf);
+ next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+ LH_OVERFLOW_PAGE,
+ bstrategy);
+
buf = next_buf;
}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..dc119a3 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -778,9 +778,7 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
* Caller must acquire cleanup lock on the primary page of the target
* bucket to exclude any scans that are in progress, which could easily
* be confused into returning the same tuple more than once or some tuples
- * not at all by the rearrangement we are performing here. To prevent
- * any concurrent scan to cross the squeeze scan we use lock chaining
- * similar to hasbucketcleanup. Refer comments atop hashbucketcleanup.
+ * not at all by the rearrangement we are performing here.
*
* We need to retain a pin on the primary bucket to ensure that no concurrent
* split can start.
--
1.8.3.1
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers