On 01/14/2014 05:35 PM, Alexander Korotkov wrote:
Attached version is rebased against last version of packed posting lists.

Thanks!

I think we're missing a trick with multi-key queries. We know that when multiple scan keys are used, they are ANDed together, so we can do the skip optimization even without the new tri-state consistent function.

To get started, I propose the three attached patches. These only implement the optimization for the multi-key case, which doesn't require any changes to the consistent functions and hence no catalog changes. Admittedly this isn't anywhere near as useful in practice as the single key case, but let's go for the low-hanging fruit first. This nevertheless introduces some machinery that will be needed by the full patch anyway.

I structured the code somewhat differently than your patch. There is no separate fast-path for the case where the optimization applies. Instead, I'm passing the advancePast variable all the way down to where the next batch of items are loaded from the posting tree. keyGetItem is now responsible for advancing the entry streams, and the logic in scanGetItem has been refactored so that it advances advancePast aggressively, as soon as one of the key streams let us conclude that no items < a certain point can match.

scanGetItem might yet need to be refactored when we get to the full preconsistent check stuff, but one step at a time.


The first patch is the most interesting one, and contains the scanGetItem changes. The second patch allows seeking to the right segment in a posting tree page, and the third allows starting the posting tree scan from root, when skipping items (instead of just following the right-links).

Here are some simple performance test results, demonstrating the effect of each of these patches. This is a best-case scenario. I don't think these patches has any adverse effects even in the worst-case scenario, although I haven't actually tried hard to measure that. The used this to create a test table:

create table foo (intarr int[]);
-- Every row contains 0 (frequent term), and a unique number.
insert into foo select array[0,g] from generate_series(1, 10000000) g;
-- Add another tuple with 0, 1 combo physically to the end of the table.
insert into foo values (array[0,1]);

The query I used is this:

postgres=# select count(*) from foo where intarr @> array[0] and intarr @> array[1];
 count
-------
     2
(1 row)

I measured the time that query takes, and the number of pages hit, using "explain (analyze, buffers true) ...".

patches         time (ms)       buffers
---
unpatched       650             1316
patch 1         0.52            1316
patches 1+2     0.50            1316
patches 1+2+3   0.13            15

So, the second patch isn't doing much in this particular case. But it's trivial, and I think it will make a difference in other queries where you have the opportunity skip, but return a lot of tuples overall.

In summary, these are fairly small patches, and useful on their, so I think these should be committed now. But please take a look and see if the logic in scanGetItem/keyGetItem looks correct to you. After this, I think the main fast scan logic will go into keyGetItem.

PS. I find it a bit surprising that in your patch, you're completely bailing out if there are any partial-match keys involved. Is there some fundamental reason for that, or just not implemented?

- Heikki
>From 53e33c931c41f5ff8bb22ecfc011e717d2dbb9fd Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Thu, 23 Jan 2014 15:41:43 +0200
Subject: [PATCH 1/3] Optimize GIN multi-key queries.

In a multi-key search, ie. something like "col @> 'foo' AND col @> 'bar'",
as soon as we find the next item that matches the first criteria, we don't
need to check the second criteria for TIDs smaller the first match. That saves
a lot of effort, especially if one of the first term is rare, while the
second occurs very frequently.

Based on ideas from Alexander Korotkov's fast scan patch
---
 src/backend/access/gin/ginget.c | 465 ++++++++++++++++++++++------------------
 1 file changed, 255 insertions(+), 210 deletions(-)

diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 4bdbd45..4de7a10 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -68,29 +68,6 @@ callConsistentFn(GinState *ginstate, GinScanKey key)
 }
 
 /*
- * Tries to refind previously taken ItemPointer on a posting page.
- */
-static bool
-needToStepRight(Page page, ItemPointer item)
-{
-	if (GinPageGetOpaque(page)->flags & GIN_DELETED)
-		/* page was deleted by concurrent vacuum */
-		return true;
-
-	if (ginCompareItemPointers(item, GinDataPageGetRightBound(page)) > 0
-			&& !GinPageRightMost(page))
-	{
-		/*
-		 * the item we're looking is > the right bound of the page, so it
-		 * can't be on this page.
-		 */
-		return true;
-	}
-
-	return false;
-}
-
-/*
  * Goes to the next page if current offset is outside of bounds
  */
 static bool
@@ -447,8 +424,7 @@ restartScanEntry:
 			page = BufferGetPage(entry->buffer);
 
 			/*
-			 * Copy page content to memory to avoid keeping it locked for
-			 * a long time.
+			 * Load the first page into memory.
 			 */
 			entry->list = GinDataLeafPageGetItems(page, &entry->nlist);
 
@@ -518,90 +494,87 @@ startScan(IndexScanDesc scan)
 }
 
 /*
- * Gets next ItemPointer from PostingTree. Note, that we copy
- * page into GinScanEntry->list array and unlock page, but keep it pinned
- * to prevent interference with vacuum
+ * Load the next batch of item pointers from a posting tree.
+ *
+ * Note that we copy the page into GinScanEntry->list array and unlock it, but
+ * keep it pinned to prevent interference with vacuum.
  */
 static void
-entryGetNextItem(GinState *ginstate, GinScanEntry entry)
+entryLoadMoreItems(GinState *ginstate, GinScanEntry entry, ItemPointerData advancePast)
 {
 	Page		page;
 	int			i;
 
+	LockBuffer(entry->buffer, GIN_SHARE);
+	page = BufferGetPage(entry->buffer);
 	for (;;)
 	{
-		if (entry->offset < entry->nlist)
+		entry->offset = InvalidOffsetNumber;
+		if (entry->list)
 		{
-			entry->curItem = entry->list[entry->offset++];
-			return;
+			pfree(entry->list);
+			entry->list = NULL;
+			entry->nlist = 0;
 		}
 
-		LockBuffer(entry->buffer, GIN_SHARE);
-		page = BufferGetPage(entry->buffer);
-		for (;;)
+		/*
+		 * We've processed all the entries on this page. If it was the last
+		 * page in the tree, we're done.
+		 */
+		if (GinPageRightMost(page))
 		{
-			/*
-			 * It's needed to go by right link. During that we should refind
-			 * first ItemPointer greater that stored
-			 */
-			if (GinPageRightMost(page))
-			{
-				UnlockReleaseBuffer(entry->buffer);
-				ItemPointerSetInvalid(&entry->curItem);
-				entry->buffer = InvalidBuffer;
-				entry->isFinished = TRUE;
-				return;
-			}
+			UnlockReleaseBuffer(entry->buffer);
+			entry->buffer = InvalidBuffer;
+			entry->isFinished = TRUE;
+			return;
+		}
 
-			entry->buffer = ginStepRight(entry->buffer,
-										 ginstate->index,
-										 GIN_SHARE);
-			page = BufferGetPage(entry->buffer);
+		if (GinPageGetOpaque(page)->flags & GIN_DELETED)
+			continue;		/* page was deleted by concurrent vacuum */
 
-			entry->offset = InvalidOffsetNumber;
-			if (entry->list)
-			{
-				pfree(entry->list);
-				entry->list = NULL;
-			}
+		/*
+		 * Step to next page, following the right link. then find the first
+		 * ItemPointer greater than advancePast.
+		 */
+		entry->buffer = ginStepRight(entry->buffer,
+									 ginstate->index,
+									 GIN_SHARE);
+		page = BufferGetPage(entry->buffer);
 
+		/*
+		 * The first item > advancePast might not be on this page, but
+		 * somewhere to the right, if the page was split. Keep following
+		 * the right-links until we re-find the correct page.
+		 */
+		if (!GinPageRightMost(page) &&
+			ginCompareItemPointers(&advancePast, GinDataPageGetRightBound(page)) >= 0)
+		{
 			/*
-			 * If the page was concurrently split, we have to re-find the
-			 * item we were stopped on. If the page was split more than once,
-			 * the item might not be on this page, but somewhere to the right.
-			 * Keep following the right-links until we re-find the correct
-			 * page.
+			 * the item we're looking is > the right bound of the page, so it
+			 * can't be on this page.
 			 */
-			if (ItemPointerIsValid(&entry->curItem) &&
-				needToStepRight(page, &entry->curItem))
-			{
-				continue;
-			}
+			continue;
+		}
 
-			entry->list = GinDataLeafPageGetItems(page, &entry->nlist);
+		entry->list = GinDataLeafPageGetItems(page, &entry->nlist);
 
-			/* re-find the item we were stopped on. */
-			if (ItemPointerIsValid(&entry->curItem))
+		if (ItemPointerIsValid(&advancePast))
+		{
+			for (i = 0; i < entry->nlist; i++)
 			{
-				for (i = 0; i < entry->nlist; i++)
+				if (ginCompareItemPointers(&advancePast, &entry->list[i]) < 0)
 				{
-					if (ginCompareItemPointers(&entry->curItem,
-											   &entry->list[i]) < 0)
-					{
-						LockBuffer(entry->buffer, GIN_UNLOCK);
-						entry->offset = i + 1;
-						entry->curItem = entry->list[entry->offset - 1];
-						return;
-					}
+					LockBuffer(entry->buffer, GIN_UNLOCK);
+					entry->offset = i;
+					return;
 				}
 			}
-			else
-			{
-				LockBuffer(entry->buffer, GIN_UNLOCK);
-				entry->offset = 1; /* scan all items on the page. */
-				entry->curItem = entry->list[entry->offset - 1];
-				return;
-			}
+		}
+		else
+		{
+			LockBuffer(entry->buffer, GIN_UNLOCK);
+			entry->offset = 0; /* scan all items on the page. */
+			return;
 		}
 	}
 }
@@ -610,10 +583,10 @@ entryGetNextItem(GinState *ginstate, GinScanEntry entry)
 #define dropItem(e) ( gin_rand() > ((double)GinFuzzySearchLimit)/((double)((e)->predictNumberResult)) )
 
 /*
- * Sets entry->curItem to next heap item pointer for one entry of one scan key,
- * or sets entry->isFinished to TRUE if there are no more.
+ * Sets entry->curItem to next heap item pointer > advancePast, for one entry
+ * of one scan key, or sets entry->isFinished to TRUE if there are no more.
  *
- * Item pointers must be returned in ascending order.
+ * Item pointers are returned in ascending order.
  *
  * Note: this can return a "lossy page" item pointer, indicating that the
  * entry potentially matches all items on that heap page.  However, it is
@@ -623,12 +596,20 @@ entryGetNextItem(GinState *ginstate, GinScanEntry entry)
  * current implementation this is guaranteed by the behavior of tidbitmaps.
  */
 static void
-entryGetItem(GinState *ginstate, GinScanEntry entry)
+entryGetItem(GinState *ginstate, GinScanEntry entry,
+			 ItemPointerData advancePast)
 {
 	Assert(!entry->isFinished);
 
+	Assert(!ItemPointerIsValid(&entry->curItem) ||
+		   ginCompareItemPointers(&entry->curItem, &advancePast) <= 0);
+
 	if (entry->matchBitmap)
 	{
+		/* A bitmap result */
+		BlockNumber advancePastBlk = GinItemPointerGetBlockNumber(&advancePast);
+		OffsetNumber advancePastOff = GinItemPointerGetOffsetNumber(&advancePast);
+
 		do
 		{
 			if (entry->matchResult == NULL ||
@@ -646,6 +627,18 @@ entryGetItem(GinState *ginstate, GinScanEntry entry)
 				}
 
 				/*
+				 * If all the matches on this page are <= advancePast, skip
+				 * to next page.
+				 */
+				if (entry->matchResult->blockno < advancePastBlk ||
+					(entry->matchResult->blockno == advancePastBlk &&
+					 entry->matchResult->offsets[entry->offset] <= advancePastOff))
+				{
+					entry->offset = entry->matchResult->ntuples;
+					continue;
+				}
+
+				/*
 				 * Reset counter to the beginning of entry->matchResult. Note:
 				 * entry->offset is still greater than matchResult->ntuples if
 				 * matchResult is lossy.  So, on next call we will get next
@@ -670,6 +663,17 @@ entryGetItem(GinState *ginstate, GinScanEntry entry)
 				break;
 			}
 
+			if (entry->matchResult->blockno == advancePastBlk)
+			{
+				/*
+				 * Skip to the right offset on this page. We already checked
+				 * in above loop that there is at least one item > advancePast
+				 * on the page.
+				 */
+				while (entry->matchResult->offsets[entry->offset] <= advancePastOff)
+					entry->offset++;
+			}
+
 			ItemPointerSet(&entry->curItem,
 						   entry->matchResult->blockno,
 						   entry->matchResult->offsets[entry->offset]);
@@ -678,29 +682,48 @@ entryGetItem(GinState *ginstate, GinScanEntry entry)
 	}
 	else if (!BufferIsValid(entry->buffer))
 	{
-		entry->offset++;
-		if (entry->offset <= entry->nlist)
-			entry->curItem = entry->list[entry->offset - 1];
-		else
+		/* A posting list from an entry tuple  */
+		do
 		{
-			ItemPointerSetInvalid(&entry->curItem);
-			entry->isFinished = TRUE;
-		}
+			if (entry->offset >= entry->nlist)
+			{
+				ItemPointerSetInvalid(&entry->curItem);
+				entry->isFinished = TRUE;
+				break;
+			}
+
+			entry->curItem = entry->list[entry->offset++];
+		} while (ginCompareItemPointers(&entry->curItem, &advancePast) <= 0);
+		/* XXX: shouldn't we apply the fuzzy search limit here? */
 	}
 	else
 	{
+		/* A posting tree */
 		do
 		{
-			entryGetNextItem(ginstate, entry);
-		} while (entry->isFinished == FALSE &&
-				 entry->reduceResult == TRUE &&
-				 dropItem(entry));
+			/* If we've processed the current batch, load more items */
+			while (entry->offset >= entry->nlist)
+			{
+				entryLoadMoreItems(ginstate, entry, advancePast);
+
+				if (entry->isFinished)
+				{
+					ItemPointerSetInvalid(&entry->curItem);
+					return;
+				}
+			}
+
+			entry->curItem = entry->list[entry->offset++];
+
+		} while (ginCompareItemPointers(&entry->curItem, &advancePast) <= 0 &&
+				 entry->reduceResult == TRUE && dropItem(entry));
 	}
 }
 
 /*
- * Identify the "current" item among the input entry streams for this scan key,
- * and test whether it passes the scan key qual condition.
+ * Identify the "current" item among the input entry streams for this scan key
+ * that is greater than advancePast, and test whether it passes the scan key
+ * qual condition.
  *
  * The current item is the smallest curItem among the inputs.  key->curItem
  * is set to that value.  key->curItemMatches is set to indicate whether that
@@ -719,7 +742,8 @@ entryGetItem(GinState *ginstate, GinScanEntry entry)
  * logic in scanGetItem.)
  */
 static void
-keyGetItem(GinState *ginstate, MemoryContext tempCtx, GinScanKey key)
+keyGetItem(GinState *ginstate, MemoryContext tempCtx, GinScanKey key,
+		   ItemPointerData advancePast)
 {
 	ItemPointerData minItem;
 	ItemPointerData curPageLossy;
@@ -729,11 +753,20 @@ keyGetItem(GinState *ginstate, MemoryContext tempCtx, GinScanKey key)
 	GinScanEntry entry;
 	bool		res;
 	MemoryContext oldCtx;
+	bool		allFinished;
 
 	Assert(!key->isFinished);
 
 	/*
-	 * Find the minimum of the active entry curItems.
+	 * We might have already tested this item; if so, no need to repeat work.
+	 * (Note: the ">" case can happen, if minItem is exact but we previously
+	 * had to set curItem to a lossy-page pointer.)
+	 */
+	if (ginCompareItemPointers(&key->curItem, &advancePast) > 0)
+		return;
+
+	/*
+	 * Find the minimum item > advancePast among the active entry streams.
 	 *
 	 * Note: a lossy-page entry is encoded by a ItemPointer with max value for
 	 * offset (0xffff), so that it will sort after any exact entries for the
@@ -741,16 +774,33 @@ keyGetItem(GinState *ginstate, MemoryContext tempCtx, GinScanKey key)
 	 * pointers, which is good.
 	 */
 	ItemPointerSetMax(&minItem);
-
+	allFinished = true;
 	for (i = 0; i < key->nentries; i++)
 	{
 		entry = key->scanEntry[i];
-		if (entry->isFinished == FALSE &&
-			ginCompareItemPointers(&entry->curItem, &minItem) < 0)
-			minItem = entry->curItem;
+
+		/*
+		 * Advance this stream if necessary.
+		 *
+		 * In particular, since entry->curItem was initialized with
+		 * ItemPointerSetMin, this ensures we fetch the first item for each
+		 * entry on the first call.
+		 */
+		while (entry->isFinished == FALSE &&
+			   ginCompareItemPointers(&entry->curItem, &advancePast) <= 0)
+		{
+			entryGetItem(ginstate, entry, advancePast);
+		}
+
+		if (!entry->isFinished)
+		{
+			allFinished = FALSE;
+			if (ginCompareItemPointers(&entry->curItem, &minItem) < 0)
+				minItem = entry->curItem;
+		}
 	}
 
-	if (ItemPointerIsMax(&minItem))
+	if (allFinished)
 	{
 		/* all entries are finished */
 		key->isFinished = TRUE;
@@ -758,15 +808,7 @@ keyGetItem(GinState *ginstate, MemoryContext tempCtx, GinScanKey key)
 	}
 
 	/*
-	 * We might have already tested this item; if so, no need to repeat work.
-	 * (Note: the ">" case can happen, if minItem is exact but we previously
-	 * had to set curItem to a lossy-page pointer.)
-	 */
-	if (ginCompareItemPointers(&key->curItem, &minItem) >= 0)
-		return;
-
-	/*
-	 * OK, advance key->curItem and perform consistentFn test.
+	 * OK, set key->curItem and perform consistentFn test.
 	 */
 	key->curItem = minItem;
 
@@ -895,117 +937,120 @@ keyGetItem(GinState *ginstate, MemoryContext tempCtx, GinScanKey key)
  * keyGetItem() the combination logic is known only to the consistentFn.
  */
 static bool
-scanGetItem(IndexScanDesc scan, ItemPointer advancePast,
+scanGetItem(IndexScanDesc scan, ItemPointerData advancePast,
 			ItemPointerData *item, bool *recheck)
 {
 	GinScanOpaque so = (GinScanOpaque) scan->opaque;
-	GinState   *ginstate = &so->ginstate;
-	ItemPointerData myAdvancePast = *advancePast;
 	uint32		i;
-	bool		allFinished;
 	bool		match;
 
-	for (;;)
+	/*----------
+	 * Advance the scan keys in lock-step, until we find an item that
+	 * matches all the keys. If any key reports isFinished, meaning its
+	 * subset of the entries is exhausted, we can stop.  Otherwise, set
+	 * *item to the next matching item.
+	 *
+	 * Now *item contains the first ItemPointer after previous result that
+	 * passed the consistentFn check for that exact TID, or a lossy reference
+	 * to the same page.
+	 *
+	 * This logic works only if a keyGetItem stream can never contain both
+	 * exact and lossy pointers for the same page.	Else we could have a
+	 * case like
+	 *
+	 *		stream 1		stream 2
+	 *		...				...
+	 *		42/6			42/7
+	 *		50/1			42/0xffff
+	 *		...				...
+	 *
+	 * We would conclude that 42/6 is not a match and advance stream 1,
+	 * thus never detecting the match to the lossy pointer in stream 2.
+	 * (keyGetItem has a similar problem versus entryGetItem.)
+	 *----------
+	 */
+	ItemPointerSetMin(item);
+	do
 	{
-		/*
-		 * Advance any entries that are <= myAdvancePast.  In particular,
-		 * since entry->curItem was initialized with ItemPointerSetMin, this
-		 * ensures we fetch the first item for each entry on the first call.
-		 */
-		allFinished = TRUE;
-
-		for (i = 0; i < so->totalentries; i++)
-		{
-			GinScanEntry entry = so->entries[i];
-
-			while (entry->isFinished == FALSE &&
-				   ginCompareItemPointers(&entry->curItem,
-										  &myAdvancePast) <= 0)
-				entryGetItem(ginstate, entry);
-
-			if (entry->isFinished == FALSE)
-				allFinished = FALSE;
-		}
-
-		if (allFinished)
-		{
-			/* all entries exhausted, so we're done */
-			return false;
-		}
-
-		/*
-		 * Perform the consistentFn test for each scan key.  If any key
-		 * reports isFinished, meaning its subset of the entries is exhausted,
-		 * we can stop.  Otherwise, set *item to the minimum of the key
-		 * curItems.
-		 */
-		ItemPointerSetMax(item);
-
-		for (i = 0; i < so->nkeys; i++)
+		match = true;
+		for (i = 0; i < so->nkeys && match; i++)
 		{
 			GinScanKey	key = so->keys + i;
 
-			keyGetItem(&so->ginstate, so->tempCtx, key);
+			/* Fetch the next item for this key. */
+			keyGetItem(&so->ginstate, so->tempCtx, key, advancePast);
 
 			if (key->isFinished)
-				return false;	/* finished one of keys */
-
-			if (ginCompareItemPointers(&key->curItem, item) < 0)
-				*item = key->curItem;
-		}
+				return false;
 
-		Assert(!ItemPointerIsMax(item));
+			/*
+			 * If it's not a match, we can immediately conclude that nothing
+			 * <= this item matches, without checking the rest of the keys.
+			 */
+			if (!key->curItemMatches)
+			{
+				advancePast = key->curItem;
+				match = false;
+				break;
+			}
 
-		/*----------
-		 * Now *item contains first ItemPointer after previous result.
-		 *
-		 * The item is a valid hit only if all the keys succeeded for either
-		 * that exact TID, or a lossy reference to the same page.
-		 *
-		 * This logic works only if a keyGetItem stream can never contain both
-		 * exact and lossy pointers for the same page.	Else we could have a
-		 * case like
-		 *
-		 *		stream 1		stream 2
-		 *		...				...
-		 *		42/6			42/7
-		 *		50/1			42/0xffff
-		 *		...				...
-		 *
-		 * We would conclude that 42/6 is not a match and advance stream 1,
-		 * thus never detecting the match to the lossy pointer in stream 2.
-		 * (keyGetItem has a similar problem versus entryGetItem.)
-		 *----------
-		 */
-		match = true;
-		for (i = 0; i < so->nkeys; i++)
-		{
-			GinScanKey	key = so->keys + i;
+			/*
+			 * It's a match. We can conclude that nothing < matches, so
+			 * the other key streams can skip to this item.
+			 * Beware of lossy pointers, though; for a lossy pointer, we
+			 * can only conclude that nothing smaller than this *page*
+			 * matches.
+			 */
+			advancePast = key->curItem;
+			if (ItemPointerIsLossyPage(&advancePast))
+			{
+				advancePast.ip_posid = 0;
+			}
+			else
+			{
+				Assert(advancePast.ip_posid > 0);
+				advancePast.ip_posid--;
+			}
 
-			if (key->curItemMatches)
+			/*
+			 * If this is the first key, remember this location as a
+			 * potential match.
+			 *
+			 * Otherwise, check if this is the same item that we checked the
+			 * previous keys for (or a lossy pointer for the same page). If
+			 * not, loop back to check the previous keys for this item (we
+			 * will check this key again too, but keyGetItem returns quickly
+			 * for that)
+			 */
+			if (i == 0)
 			{
-				if (ginCompareItemPointers(item, &key->curItem) == 0)
-					continue;
-				if (ItemPointerIsLossyPage(&key->curItem) &&
-					GinItemPointerGetBlockNumber(&key->curItem) ==
-					GinItemPointerGetBlockNumber(item))
-					continue;
+				*item = key->curItem;
+			}
+			else
+			{
+				if (ItemPointerIsLossyPage(&key->curItem) ||
+					ItemPointerIsLossyPage(item))
+				{
+					Assert (GinItemPointerGetBlockNumber(&key->curItem) >= GinItemPointerGetBlockNumber(item));
+					match = (GinItemPointerGetBlockNumber(&key->curItem) ==
+							 GinItemPointerGetBlockNumber(item));
+				}
+				else
+				{
+					Assert(ginCompareItemPointers(&key->curItem, item) >= 0);
+					match = (ginCompareItemPointers(&key->curItem, item) == 0);
+				}
 			}
-			match = false;
-			break;
 		}
+	} while (!match);
 
-		if (match)
-			break;
-
-		/*
-		 * No hit.	Update myAdvancePast to this TID, so that on the next pass
-		 * we'll move to the next possible entry.
-		 */
-		myAdvancePast = *item;
-	}
+	Assert(!ItemPointerIsMin(item));
 
 	/*
+	 * Now *item contains the first ItemPointer after previous result that
+	 * passed the consistentFn check for that exact TID, or a lossy reference
+	 * to the same page.
+	 *
 	 * We must return recheck = true if any of the keys are marked recheck.
 	 */
 	*recheck = false;
@@ -1536,7 +1581,7 @@ gingetbitmap(PG_FUNCTION_ARGS)
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		if (!scanGetItem(scan, &iptr, &iptr, &recheck))
+		if (!scanGetItem(scan, iptr, &iptr, &recheck))
 			break;
 
 		if (ItemPointerIsLossyPage(&iptr))
-- 
1.8.5.2

>From fdd605789c069e40b7a001c689222a906ebdd6f5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Thu, 23 Jan 2014 15:47:54 +0200
Subject: [PATCH 2/3] Further optimize the multi-key GIN searches.

If we're skipping past a certain TID, avoid decoding posting list segments
that only contain smaller TIDs.
---
 src/backend/access/gin/gindatapage.c | 32 +++++++++++++++++++++++++++++---
 src/backend/access/gin/ginget.c      |  6 ++++--
 src/include/access/gin_private.h     |  2 +-
 3 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 8504f4c..a339028 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -97,18 +97,44 @@ static void dataPlaceToPageLeafSplit(Buffer buf,
 
 /*
  * Read all TIDs from leaf data page to single uncompressed array.
+ *
+ * If advancePast is valid, the caller is only interested in TIDs > advancePast.
+ * This function can still return items smaller than that, so the caller
+ * must still check them, but passing it allows this function to skip some
+ * items as an optimization.
  */
 ItemPointer
-GinDataLeafPageGetItems(Page page, int *nitems)
+GinDataLeafPageGetItems(Page page, int *nitems, ItemPointerData advancePast)
 {
 	ItemPointer result;
 
 	if (GinPageIsCompressed(page))
 	{
-		GinPostingList *ptr = GinDataLeafPageGetPostingList(page);
+		GinPostingList *seg = GinDataLeafPageGetPostingList(page);
 		Size		len = GinDataLeafPageGetPostingListSize(page);
+		Pointer		endptr = ((Pointer) seg) + len;
+		GinPostingList *next;
 
-		result = ginPostingListDecodeAllSegments(ptr, len, nitems);
+		/* Skip to the segment containing advancePast+1 */
+		if (ItemPointerIsValid(&advancePast))
+		{
+			next = GinNextPostingListSegment(seg);
+			while ((Pointer) next < endptr &&
+				   ginCompareItemPointers(&next->first, &advancePast) <= 0)
+			{
+				seg = next;
+				next = GinNextPostingListSegment(seg);
+			}
+			len = endptr - (Pointer) seg;
+		}
+
+		if (len > 0)
+			result = ginPostingListDecodeAllSegments(seg, len, nitems);
+		else
+		{
+			result = palloc(0);
+			*nitems = 0;
+		}
 	}
 	else
 	{
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 4de7a10..5d7738f 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -400,6 +400,7 @@ restartScanEntry:
 			BlockNumber rootPostingTree = GinGetPostingTree(itup);
 			GinBtreeStack *stack;
 			Page		page;
+			ItemPointerData minItem;
 
 			/*
 			 * We should unlock entry page before touching posting tree to
@@ -426,7 +427,8 @@ restartScanEntry:
 			/*
 			 * Load the first page into memory.
 			 */
-			entry->list = GinDataLeafPageGetItems(page, &entry->nlist);
+			ItemPointerSetMin(&minItem);
+			entry->list = GinDataLeafPageGetItems(page, &entry->nlist, minItem);
 
 			entry->predictNumberResult = stack->predictNumber * entry->nlist;
 
@@ -556,7 +558,7 @@ entryLoadMoreItems(GinState *ginstate, GinScanEntry entry, ItemPointerData advan
 			continue;
 		}
 
-		entry->list = GinDataLeafPageGetItems(page, &entry->nlist);
+		entry->list = GinDataLeafPageGetItems(page, &entry->nlist, advancePast);
 
 		if (ItemPointerIsValid(&advancePast))
 		{
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 3f92c37..8c350b9 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -692,7 +692,7 @@ extern ItemPointer ginReadTuple(GinState *ginstate, OffsetNumber attnum,
 			 IndexTuple itup, int *nitems);
 
 /* gindatapage.c */
-extern ItemPointer GinDataLeafPageGetItems(Page page, int *nitems);
+extern ItemPointer GinDataLeafPageGetItems(Page page, int *nitems, ItemPointerData advancePast);
 extern int GinDataLeafPageGetItemsToTbm(Page page, TIDBitmap *tbm);
 extern BlockNumber createPostingTree(Relation index,
 				  ItemPointerData *items, uint32 nitems,
-- 
1.8.5.2

>From a030774cc5fd9720c988e65b500e8ab8a5fb4d7e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Thu, 23 Jan 2014 16:55:51 +0200
Subject: [PATCH 3/3] Further optimize GIN multi-key searches.

When skipping over some items in a posting tree, re-find the new location
by descending the tree from root, rather than walking the right links.
This can save a lot of I/O.
---
 src/backend/access/gin/gindatapage.c |  9 ++--
 src/backend/access/gin/ginget.c      | 90 +++++++++++++++++++++++++++---------
 src/include/access/gin_private.h     |  3 +-
 3 files changed, 75 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index a339028..2961e5c 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -1619,16 +1619,15 @@ ginInsertItemPointers(Relation index, BlockNumber rootBlkno,
  * Starts a new scan on a posting tree.
  */
 GinBtreeStack *
-ginScanBeginPostingTree(Relation index, BlockNumber rootBlkno)
+ginScanBeginPostingTree(GinBtree btree, Relation index, BlockNumber rootBlkno)
 {
-	GinBtreeData btree;
 	GinBtreeStack *stack;
 
-	ginPrepareDataScan(&btree, index, rootBlkno);
+	ginPrepareDataScan(btree, index, rootBlkno);
 
-	btree.fullScan = TRUE;
+	btree->fullScan = TRUE;
 
-	stack = ginFindLeafPage(&btree, TRUE);
+	stack = ginFindLeafPage(btree, TRUE);
 
 	return stack;
 }
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 5d7738f..72c9e61 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -99,12 +99,13 @@ static void
 scanPostingTree(Relation index, GinScanEntry scanEntry,
 				BlockNumber rootPostingTree)
 {
+	GinBtreeData btree;
 	GinBtreeStack *stack;
 	Buffer		buffer;
 	Page		page;
 
 	/* Descend to the leftmost leaf page */
-	stack = ginScanBeginPostingTree(index, rootPostingTree);
+	stack = ginScanBeginPostingTree(&btree, index, rootPostingTree);
 	buffer = stack->buffer;
 	IncrBufferRefCount(buffer); /* prevent unpin in freeGinBtreeStack */
 
@@ -412,7 +413,8 @@ restartScanEntry:
 			LockBuffer(stackEntry->buffer, GIN_UNLOCK);
 			needUnlock = FALSE;
 
-			stack = ginScanBeginPostingTree(ginstate->index, rootPostingTree);
+			stack = ginScanBeginPostingTree(&entry->btree, ginstate->index,
+											rootPostingTree);
 			entry->buffer = stack->buffer;
 
 			/*
@@ -506,8 +508,50 @@ entryLoadMoreItems(GinState *ginstate, GinScanEntry entry, ItemPointerData advan
 {
 	Page		page;
 	int			i;
+	bool		stepright;
+
+	/*
+	 * We have two strategies for finding the correct page: step right from
+	 * the current page, or descend the tree again from the root. If
+	 * advancePast equals the current item, the next matching item should be
+	 * on the next page, so we step right. Otherwise, descend from root.
+	 */
+	if (ginCompareItemPointers(&entry->curItem, &advancePast) == 0)
+	{
+		stepright = true;
+		LockBuffer(entry->buffer, GIN_SHARE);
+	}
+	else
+	{
+		GinBtreeStack *stack;
+
+		ReleaseBuffer(entry->buffer);
+
+		/*
+		 * Set the search key, and find the correct leaf page.
+		 *
+		 * XXX: This is off by one, we're searching for an item > advancePast,
+		 * but we're asking the tree for the next item >= advancePast. It only
+		 * makes a difference in the corner case that advancePast is the
+		 * right bound of a page, in which case we'll scan one page
+		 * unnecessarily. Other than that it's harmless.
+		 */
+		entry->btree.itemptr = advancePast;
+		entry->btree.fullScan = false;
+		stack = ginFindLeafPage(&entry->btree, true);
+
+		/* we don't need the stack, just the buffer. */
+		entry->buffer = stack->buffer;
+		IncrBufferRefCount(entry->buffer);
+		freeGinBtreeStack(stack);
+		stepright = false;
+	}
+
+	elog(LOG, "entryLoadMoreItems, %u/%u, skip: %d",
+		 ItemPointerGetBlockNumber(&advancePast),
+		 ItemPointerGetOffsetNumber(&advancePast),
+		 !stepright);
 
-	LockBuffer(entry->buffer, GIN_SHARE);
 	page = BufferGetPage(entry->buffer);
 	for (;;)
 	{
@@ -519,31 +563,35 @@ entryLoadMoreItems(GinState *ginstate, GinScanEntry entry, ItemPointerData advan
 			entry->nlist = 0;
 		}
 
-		/*
-		 * We've processed all the entries on this page. If it was the last
-		 * page in the tree, we're done.
-		 */
-		if (GinPageRightMost(page))
+		if (stepright)
 		{
-			UnlockReleaseBuffer(entry->buffer);
-			entry->buffer = InvalidBuffer;
-			entry->isFinished = TRUE;
-			return;
+			/*
+			 * We've processed all the entries on this page. If it was the last
+			 * page in the tree, we're done.
+			 */
+			if (GinPageRightMost(page))
+			{
+				UnlockReleaseBuffer(entry->buffer);
+				entry->buffer = InvalidBuffer;
+				entry->isFinished = TRUE;
+				return;
+			}
+
+			/*
+			 * Step to next page, following the right link. then find the first
+			 * ItemPointer greater than advancePast.
+			 */
+			entry->buffer = ginStepRight(entry->buffer,
+										 ginstate->index,
+										 GIN_SHARE);
+			page = BufferGetPage(entry->buffer);
 		}
+		stepright = true;
 
 		if (GinPageGetOpaque(page)->flags & GIN_DELETED)
 			continue;		/* page was deleted by concurrent vacuum */
 
 		/*
-		 * Step to next page, following the right link. then find the first
-		 * ItemPointer greater than advancePast.
-		 */
-		entry->buffer = ginStepRight(entry->buffer,
-									 ginstate->index,
-									 GIN_SHARE);
-		page = BufferGetPage(entry->buffer);
-
-		/*
 		 * The first item > advancePast might not be on this page, but
 		 * somewhere to the right, if the page was split. Keep following
 		 * the right-links until we re-find the correct page.
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 8c350b9..a12dfc3 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -703,7 +703,7 @@ extern void GinPageDeletePostingItem(Page page, OffsetNumber offset);
 extern void ginInsertItemPointers(Relation index, BlockNumber rootBlkno,
 					  ItemPointerData *items, uint32 nitem,
 					  GinStatsData *buildStats);
-extern GinBtreeStack *ginScanBeginPostingTree(Relation index, BlockNumber rootBlkno);
+extern GinBtreeStack *ginScanBeginPostingTree(GinBtree btree, Relation index, BlockNumber rootBlkno);
 extern void ginDataFillRoot(GinBtree btree, Page root, BlockNumber lblkno, Page lpage, BlockNumber rblkno, Page rpage);
 extern void ginPrepareDataScan(GinBtree btree, Relation index, BlockNumber rootBlkno);
 
@@ -803,6 +803,7 @@ typedef struct GinScanEntryData
 	bool		isFinished;
 	bool		reduceResult;
 	uint32		predictNumberResult;
+	GinBtreeData btree;
 }	GinScanEntryData;
 
 typedef struct GinScanOpaqueData
-- 
1.8.5.2

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to