Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Peter Geoghegan Thu, 28 Sep 2023 17:33:03 -0700

On Sun, Sep 17, 2023 at 4:47 PM Peter Geoghegan <p...@bowt.ie> wrote:
> Attached is v2, which makes all array key advancement take place using
> the "next index tuple" approach (using binary searches to find array
> keys using index tuple values).


Attached is v3, which fixes bitrot caused by today's bugfix commit 714780dc.

No notable changes here compared to v2.

--
Peter Geoghegan

From 2cff1dadb7903d49a2338b64b27178fa0bc51456 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 17 Jun 2023 17:03:36 -0700
Subject: [PATCH v3] Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing additional context about the arrays
down into the nbtree index AM, as index quals.  This information enabled
nbtree to execute multiple primitive index scans as part of an index
scan executor node that was treated as one continuous index scan.

The motivation behind this earlier work was enabling index-only scans
with ScalarArrayOpExpr clauses (SAOP quals are traditionally executed
via BitmapOr nodes, which is largely index-AM-agnostic, but always
requires heap access).  The general idea of giving the index AM this
additional context can be pushed a lot further, though.

Teach nbtree SAOP index scans to dynamically advance array scan keys
using information about the characteristics of the index, determined at
runtime.  The array key state machine advances the current array keys
using the next index tuple in line to be scanned, at the point where the
scan reaches the end of the last set of array keys.  This approach is
far more flexible, and can be far more efficient.  Cases that previously
required hundreds (even thousands) of primitive index scans now require
as few as one single primitive index scan.

Also remove all restrictions on generating path keys for nbtree index
scans that happen to have ScalarArrayOpExpr quals.  Bugfix commit
807a40c5 taught the planner to avoid generating unsafe path keys: path
keys on a multicolumn index path, with a SAOP clause on any attribute
beyond the first/most significant attribute.  These cases are now safe.
Now nbtree index scans with an inequality clause on a high order column
and a SAOP clause on a lower order column are executed as one single
primitive index scan, since that is the most efficient way to do it.
Non-required equality type SAOP quals are executed by nbtree using
almost the same approach used for required equality type SAOP quals.

nbtree is now strictly guaranteed to avoid all repeat accesses to any
individual leaf page, even in cases with inequalities on high order
columns (except when the scan direction changes, or the scan restarts).
We now have strong guarantees about the worst case, which is very useful
when costing index scans with SAOP clauses.  The cost profile of index
paths with multiple SAOP clauses is now a lot closer to other cases;
more selective index scans will now generally have lower costs than less
selective index scans.  The added cost from repeatedly descending the
index still matters, but it can never dominate.

An important goal of this work is to remove all ScalarArrayOpExpr clause
special cases from the planner -- ScalarArrayOpExpr clauses can now be
thought of a generalization of simple equality clauses (except when
costing index scans, perhaps).  The planner no longer needs to generate
alternative index paths with filter quals/qpquals.  We assume that true
SAOP index quals are strictly better than filter/qpquals, since the work
in nbtree guarantees that they'll be at least slightly faster.

Many of the queries sped up by the work from this commit don't directly
benefit from the nbtree/executor enhancements.  They benefit indirectly.
The planner no longer shows any restraint around making SAOP clauses
into true nbtree index quals, which tends to result in significant
savings on heap page accesses.  In general we never need visibility
checks to evaluate true index quals, whereas filter quals often need to
perform extra heap accesses, just to eliminate non-matching tuples
(expression evaluation is only safe with known visible tuples).

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
---
 src/include/access/nbtree.h                |   39 +-
 src/backend/access/nbtree/nbtree.c         |   65 +-
 src/backend/access/nbtree/nbtsearch.c      |   62 +-
 src/backend/access/nbtree/nbtutils.c       | 1367 ++++++++++++++++++--
 src/backend/optimizer/path/indxpath.c      |   64 +-
 src/backend/utils/adt/selfuncs.c           |  123 +-
 doc/src/sgml/monitoring.sgml               |   13 +
 src/test/regress/expected/create_index.out |   61 +-
 src/test/regress/expected/join.out         |    5 +-
 src/test/regress/sql/create_index.sql      |   20 +-
 10 files changed, 1506 insertions(+), 313 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6345e16d7..33db9b648 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1043,13 +1043,13 @@ typedef struct BTScanOpaqueData
 
 	/* workspace for SK_SEARCHARRAY support */
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
-	bool		arraysStarted;	/* Started array keys, but have yet to "reach
-								 * past the end" of all arrays? */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
-	int			arrayKeyCount;	/* count indicating number of array scan keys
-								 * processed */
+	bool		needPrimScan;	/* Perform another primitive scan? */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
+	FmgrInfo   *orderProcs;		/* ORDER procs for equality constraint keys */
+	int			numPrimScans;	/* Running tally of # primitive index scans
+								 * (used to coordinate parallel workers) */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
@@ -1080,6 +1080,29 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * _bt_readpage state used across _bt_checkkeys calls for a page
+ *
+ * When _bt_readpage is called during a forward scan that has one or more
+ * equality-type SK_SEARCHARRAY scan keys, it has an extra responsibility: to
+ * set up information about the page high key.  This must happen before the
+ * first call to _bt_checkkeys.  _bt_checkkeys uses this information to manage
+ * advancement of the scan's array keys.
+ */
+typedef struct BTReadPageState
+{
+	/* Input parameters, set by _bt_readpage */
+	ScanDirection dir;			/* current scan direction */
+	IndexTuple	highkey;		/* page high key, set by forward scans */
+
+	/* Output parameters, set by _bt_checkkeys */
+	bool		continuescan;	/* Terminate ongoing (primitive) index scan? */
+
+	/* Private _bt_checkkeys-managed state */
+	bool		highkeychecked; /* high key checked against current
+								 * SK_SEARCHARRAY array keys? */
+} BTReadPageState;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
@@ -1157,7 +1180,7 @@ extern bool btcanreturn(Relation index, int attno);
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
 extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
-extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+extern void _bt_parallel_next_primitive_scan(IndexScanDesc scan);
 
 /*
  * prototypes for functions in nbtdedup.c
@@ -1250,12 +1273,12 @@ extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan);
+extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+						  IndexTuple tuple, bool finaltup);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6c5b5c69c..27fbb86d0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -48,8 +48,8 @@
  * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
  * to a new page; some process can start doing that.
  *
- * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
- * We reach this state once for every distinct combination of array keys.
+ * BTPARALLEL_DONE indicates that the primitive index scan is complete
+ * (including error exit).  Reached once per primitive index scan.
  */
 typedef enum
 {
@@ -69,8 +69,8 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
-	int			btps_arrayKeyCount; /* count indicating number of array scan
-									 * keys processed by parallel scan */
+	int			btps_numPrimScans;	/* count indicating number of primitive
+									 * index scans (used with array keys) */
 	slock_t		btps_mutex;		/* protects above variables */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 }			BTParallelScanDescData;
@@ -276,7 +276,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (res)
 			break;
 		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, dir));
 
 	return res;
 }
@@ -334,7 +334,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 			}
 		}
 		/* Now see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, ForwardScanDirection));
+	} while (so->numArrayKeys && _bt_array_keys_remain(scan, ForwardScanDirection));
 
 	return ntids;
 }
@@ -364,9 +364,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 		so->keyData = NULL;
 
 	so->arrayKeyData = NULL;	/* assume no array keys for now */
-	so->arraysStarted = false;
 	so->numArrayKeys = 0;
+	so->needPrimScan = false;
 	so->arrayKeys = NULL;
+	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
@@ -406,7 +407,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
-	so->arrayKeyCount = 0;
+	so->needPrimScan = false;
+	so->numPrimScans = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -587,7 +589,7 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	bt_target->btps_arrayKeyCount = 0;
+	bt_target->btps_numPrimScans = 0;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -613,7 +615,7 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-	btscan->btps_arrayKeyCount = 0;
+	btscan->btps_numPrimScans = 0;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
@@ -624,7 +626,17 @@ btparallelrescan(IndexScanDesc scan)
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs if no pages remain for the current
- * set of scankeys.
+ * primitive index scan.
+ *
+ * When array scan keys are in use, each worker process independently advances
+ * its array keys.  It's crucial that each worker process never be allowed to
+ * scan a page from before the current scan position.
+ *
+ * XXX This particular aspect of the patch is still at the proof of concept
+ * stage.  Having this much available for review at least suggests that it'll
+ * be feasible to port the existing parallel scan array scan key stuff over to
+ * using a primitive index scan counter (as opposed to an array key counter)
+ * the top-level scan.  I have yet to really put this code through its paces.
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block number
@@ -655,16 +667,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 		SpinLockAcquire(&btscan->btps_mutex);
 		pageStatus = btscan->btps_pageStatus;
 
-		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		if (so->numPrimScans < btscan->btps_numPrimScans)
 		{
-			/* Parallel scan has already advanced to a new set of scankeys. */
+			/* Top-level scan already moved on to next primitive index scan */
 			status = false;
 		}
 		else if (pageStatus == BTPARALLEL_DONE)
 		{
 			/*
-			 * We're done with this set of scankeys.  This may be the end, or
-			 * there could be more sets to try.
+			 * We're done with this primitive index scan.  This might have
+			 * been the final primitive index scan required, or the top-level
+			 * index scan might require additional primitive scans.
 			 */
 			status = false;
 		}
@@ -696,9 +709,12 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
 void
 _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 {
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
+	Assert(!so->needPrimScan);
+
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
@@ -732,12 +748,11 @@ _bt_parallel_done(IndexScanDesc scan)
 												  parallel_scan->ps_offset);
 
 	/*
-	 * Mark the parallel scan as done for this combination of scan keys,
-	 * unless some other process already did so.  See also
-	 * _bt_advance_array_keys.
+	 * Mark the primitive index scan as done, unless some other process
+	 * already did so.  See also _bt_array_keys_remain.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+	if (so->numPrimScans >= btscan->btps_numPrimScans &&
 		btscan->btps_pageStatus != BTPARALLEL_DONE)
 	{
 		btscan->btps_pageStatus = BTPARALLEL_DONE;
@@ -751,14 +766,14 @@ _bt_parallel_done(IndexScanDesc scan)
 }
 
 /*
- * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
- *			keys.
+ * _bt_parallel_next_primitive_scan() -- Advances parallel primitive scan
+ *			counter when array keys are in use.
  *
- * Updates the count of array keys processed for both local and parallel
+ * Updates the count of primitive index scans for both local and parallel
  * scans.
  */
 void
-_bt_parallel_advance_array_keys(IndexScanDesc scan)
+_bt_parallel_next_primitive_scan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -767,13 +782,13 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
-	so->arrayKeyCount++;
+	so->numPrimScans++;
 	SpinLockAcquire(&btscan->btps_mutex);
 	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
-		btscan->btps_arrayKeyCount++;
+		btscan->btps_numPrimScans++;
 	}
 	SpinLockRelease(&btscan->btps_mutex);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 17ad89749..f15cd0870 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -893,7 +893,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (!so->qual_ok)
 	{
-		/* Notify any other workers that we're done with this scan key. */
+		/* Notify any other workers that this primitive scan is done */
 		_bt_parallel_done(scan);
 		return false;
 	}
@@ -952,6 +952,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * one we use --- by definition, they are either redundant or
 	 * contradictory.
 	 *
+	 * When SK_SEARCHARRAY keys are in use, _bt_tuple_before_array_keys is
+	 * used to avoid prematurely stopping the scan when an array equality qual
+	 * has its array keys advanced.
+	 *
 	 * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
 	 * If the index stores nulls at the end of the index we'll be starting
 	 * from, and we have no boundary key for the column (which means the key
@@ -1536,9 +1540,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
+	BTReadPageState pstate;
 	int			itemIndex;
-	bool		continuescan;
-	int			indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1558,8 +1561,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
-	continuescan = true;		/* default assumption */
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	pstate.dir = dir;
+	pstate.highkey = NULL;
+	pstate.continuescan = true; /* default assumption */
+	pstate.highkeychecked = false;
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1594,6 +1600,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		/* SK_SEARCHARRAY scans must provide high key up front */
+		if (so->numArrayKeys && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+			pstate.highkey = (IndexTuple) PageGetItem(page, iid);
+		}
+
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
@@ -1616,7 +1630,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+			if (_bt_checkkeys(scan, &pstate, itup, false))
 			{
 				/* tuple passes all scan key conditions */
 				if (!BTreeTupleIsPosting(itup))
@@ -1649,7 +1663,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
-			if (!continuescan)
+			if (!pstate.continuescan)
 				break;
 
 			offnum = OffsetNumberNext(offnum);
@@ -1666,17 +1680,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		 * only appear on non-pivot tuples on the right sibling page are
 		 * common.
 		 */
-		if (continuescan && !P_RIGHTMOST(opaque))
+		if (pstate.continuescan && !P_RIGHTMOST(opaque))
 		{
-			ItemId		iid = PageGetItemId(page, P_HIKEY);
-			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
-			int			truncatt;
+			IndexTuple	itup;
 
-			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+			if (pstate.highkey)
+				itup = pstate.highkey;
+			else
+			{
+				ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+				itup = (IndexTuple) PageGetItem(page, iid);
+			}
+
+			_bt_checkkeys(scan, &pstate, itup, true);
 		}
 
-		if (!continuescan)
+		if (!pstate.continuescan)
 			so->currPos.moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
@@ -1697,6 +1717,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
+			bool		finaltup = (offnum == minoff);
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1707,12 +1728,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 * tuple on the page, we do check the index keys, to prevent
 			 * uselessly advancing to the page to the left.  This is similar
 			 * to the high key optimization used by forward scans.
+			 *
+			 * Separately, _bt_checkkeys actually requires that we call it
+			 * with the final non-pivot tuple from the page, if there's one
+			 * (final processed tuple, or first tuple in offset number terms).
+			 * We must indicate which particular tuple comes last, too.
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
 				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				if (!finaltup)
 				{
+					Assert(offnum > minoff);
 					offnum = OffsetNumberPrev(offnum);
 					continue;
 				}
@@ -1724,8 +1751,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan);
+			passes_quals = _bt_checkkeys(scan, &pstate, itup, finaltup);
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1764,7 +1790,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					}
 				}
 			}
-			if (!continuescan)
+			if (!pstate.continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index e4528db47..292d2e322 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -33,7 +33,7 @@
 
 typedef struct BTSortArrayContext
 {
-	FmgrInfo	flinfo;
+	FmgrInfo   *orderproc;
 	Oid			collation;
 	bool		reverse;
 } BTSortArrayContext;
@@ -41,15 +41,33 @@ typedef struct BTSortArrayContext
 static Datum _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 									  StrategyNumber strat,
 									  Datum *elems, int nelems);
+static void _bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey);
 static int	_bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 									bool reverse,
 									Datum *elems, int nelems);
 static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
+static inline int32 _bt_compare_array_skey(ScanKey cur, FmgrInfo *orderproc,
+										   Datum datum, bool null,
+										   Datum arrdatum);
+static int	_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+								   BTArrayKeyInfo *array, ScanKey cur,
+								   FmgrInfo *orderproc, Datum datum, bool null,
+								   int32 *final_result);
+static bool _bt_tuple_before_array_skeys(IndexScanDesc scan,
+										 BTReadPageState *pstate,
+										 IndexTuple tuple);
+static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+								   IndexTuple tuple, bool skrequiredtrigger);
+static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
+static void _bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
+static bool _bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+							  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+							  bool *continuescan, bool *skrequiredtrigger);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								 ScanDirection dir, bool *continuescan);
@@ -202,6 +220,21 @@ _bt_freestack(BTStack stack)
  * array keys, it's sufficient to find the extreme element value and replace
  * the whole array with that scalar value.
  *
+ * In the worst case, the number of primitive index scans will equal the
+ * number of array elements (or the product of the number of array keys when
+ * there are multiple arrays/columns involved).  It's also possible that the
+ * total number of primitive index scans will be far less than that.
+ *
+ * We always sort and deduplicate arrays up-front for equality array keys.
+ * ScalarArrayOpExpr execution need only visit leaf pages that might contain
+ * matches exactly once, while preserving the sort order of the index.  This
+ * isn't just about performance; it also avoids needing duplicate elimination
+ * of matching TIDs (we prefer deduplicating search keys once, up-front).
+ * Equality SK_SEARCHARRAY keys are disjuncts that we always process in
+ * index/key space order, which makes this general approach feasible.  Every
+ * index tuple will match no more than one single distinct combination of
+ * equality-constrained keys (array keys and other equality keys).
+ *
  * Note: the reason we need so->arrayKeyData, rather than just scribbling
  * on scan->keyData, is that callers are permitted to call btrescan without
  * supplying a new set of scankey data.
@@ -212,6 +245,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(scan->indexRelation);
 	int			numArrayKeys;
 	ScanKey		cur;
 	int			i;
@@ -265,6 +299,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 
 	/* Allocate space for per-array data in the workspace context */
 	so->arrayKeys = (BTArrayKeyInfo *) palloc0(numArrayKeys * sizeof(BTArrayKeyInfo));
+	so->orderProcs = (FmgrInfo *) palloc(nkeyatts * sizeof(FmgrInfo));
 
 	/* Now process each array key */
 	numArrayKeys = 0;
@@ -281,6 +316,17 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
 		int			j;
 
 		cur = &so->arrayKeyData[i];
+
+		/*
+		 * Attributes with equality-type scan keys (including but not limited
+		 * to array scan keys) will need a 3-way comparison function.
+		 *
+		 * XXX Clean this up some more.  This repeats some of the same work
+		 * when there are multiple scan keys for the same key column.
+		 */
+		if (cur->sk_strategy == BTEqualStrategyNumber)
+			_bt_sort_cmp_func_setup(scan, cur);
+
 		if (!(cur->sk_flags & SK_SEARCHARRAY))
 			continue;
 
@@ -436,6 +482,42 @@ _bt_find_extreme_element(IndexScanDesc scan, ScanKey skey,
 	return result;
 }
 
+/*
+ * Look up the appropriate comparison function in the opfamily.
+ *
+ * Note: it's possible that this would fail, if the opfamily is incomplete,
+ * but it seems quite unlikely that an opfamily would omit non-cross-type
+ * support functions for any datatype that it supports at all.
+ */
+static void
+_bt_sort_cmp_func_setup(IndexScanDesc scan, ScanKey skey)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	Oid			elemtype;
+	RegProcedure cmp_proc;
+	FmgrInfo   *orderproc = &so->orderProcs[skey->sk_attno - 1];
+
+	/*
+	 * Determine the nominal datatype of the array elements.  We have to
+	 * support the convention that sk_subtype == InvalidOid means the opclass
+	 * input type; this is a hack to simplify life for ScanKeyInit().
+	 */
+	elemtype = skey->sk_subtype;
+	if (elemtype == InvalidOid)
+		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
+
+	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
+								 rel->rd_opcintype[skey->sk_attno - 1],
+								 elemtype,
+								 BTORDER_PROC);
+	if (!RegProcedureIsValid(cmp_proc))
+		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
+			 BTORDER_PROC, elemtype, elemtype,
+			 rel->rd_opfamily[skey->sk_attno - 1]);
+	fmgr_info_cxt(cmp_proc, orderproc, so->arrayContext);
+}
+
 /*
  * _bt_sort_array_elements() -- sort and de-dup array elements
  *
@@ -450,42 +532,14 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
 						bool reverse,
 						Datum *elems, int nelems)
 {
-	Relation	rel = scan->indexRelation;
-	Oid			elemtype;
-	RegProcedure cmp_proc;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTSortArrayContext cxt;
 
 	if (nelems <= 1)
 		return nelems;			/* no work to do */
 
-	/*
-	 * Determine the nominal datatype of the array elements.  We have to
-	 * support the convention that sk_subtype == InvalidOid means the opclass
-	 * input type; this is a hack to simplify life for ScanKeyInit().
-	 */
-	elemtype = skey->sk_subtype;
-	if (elemtype == InvalidOid)
-		elemtype = rel->rd_opcintype[skey->sk_attno - 1];
-
-	/*
-	 * Look up the appropriate comparison function in the opfamily.
-	 *
-	 * Note: it's possible that this would fail, if the opfamily is
-	 * incomplete, but it seems quite unlikely that an opfamily would omit
-	 * non-cross-type support functions for any datatype that it supports at
-	 * all.
-	 */
-	cmp_proc = get_opfamily_proc(rel->rd_opfamily[skey->sk_attno - 1],
-								 elemtype,
-								 elemtype,
-								 BTORDER_PROC);
-	if (!RegProcedureIsValid(cmp_proc))
-		elog(ERROR, "missing support function %d(%u,%u) in opfamily %u",
-			 BTORDER_PROC, elemtype, elemtype,
-			 rel->rd_opfamily[skey->sk_attno - 1]);
-
 	/* Sort the array elements */
-	fmgr_info(cmp_proc, &cxt.flinfo);
+	cxt.orderproc = &so->orderProcs[skey->sk_attno - 1];
 	cxt.collation = skey->sk_collation;
 	cxt.reverse = reverse;
 	qsort_arg(elems, nelems, sizeof(Datum),
@@ -507,7 +561,7 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	BTSortArrayContext *cxt = (BTSortArrayContext *) arg;
 	int32		compare;
 
-	compare = DatumGetInt32(FunctionCall2Coll(&cxt->flinfo,
+	compare = DatumGetInt32(FunctionCall2Coll(cxt->orderproc,
 											  cxt->collation,
 											  da, db));
 	if (cxt->reverse)
@@ -515,6 +569,171 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
 	return compare;
 }
 
+/*
+ * Comparator uses to search for the next array element when array keys need
+ * to be advanced via one or more binary searches
+ *
+ * This code is loosely based on _bt_compare.  However, there are some
+ * important differences.
+ *
+ * It is convenient to think of calling _bt_compare as comparing caller's
+ * insertion scankey to an index tuple.  But our callers are not searching
+ * through the index at all -- they're searching through a local array of
+ * datums associated with a scan key (using values they've taken from an index
+ * tuple).  This is a complete reversal of how things usually work, which can
+ * be confusing.
+ *
+ * Callers of this function should think of it as comparing "datum" (as well
+ * as "null") to "arrdatum".  This is the same approach that _bt_compare takes
+ * in that both functions compare the value that they're searching for to one
+ * particular item used as a binary search pivot.  (But it's the wrong way
+ * around if you think of it as "tuple values vs scan key values".  So don't.)
+*/
+static inline int32
+_bt_compare_array_skey(ScanKey cur,
+					   FmgrInfo *orderproc,
+					   Datum datum,
+					   bool null,
+					   Datum arrdatum)
+{
+	int32		result = 0;
+
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (cur->sk_flags & SK_ISNULL)	/* array/scan key is NULL */
+	{
+		if (null)
+			result = 0;			/* NULL "=" NULL */
+		else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;			/* NULL "<" NOT_NULL */
+		else
+			result = -1;		/* NULL ">" NOT_NULL */
+	}
+	else if (null)				/* array/scan key is NOT_NULL and tuple item
+								 * is NULL */
+	{
+		if (cur->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;		/* NOT_NULL ">" NULL */
+		else
+			result = 1;			/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * Like _bt_compare, we need to be careful of cross-type comparisons,
+		 * so the left value has to be the value that came from an index
+		 * tuple.  (Array scan keys cannot be cross-type, but other required
+		 * scan keys that use an equal operator can be.)
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(orderproc, cur->sk_collation,
+												 datum, arrdatum));
+
+		/*
+		 * Unlike _bt_compare, we flip the sign when column is a DESC column
+		 * (and *not* when column is ASC).  This matches the approach taken by
+		 * _bt_check_rowcompare, which performs similar three-way comparisons.
+		 */
+		if (cur->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(result);
+	}
+
+	return result;
+}
+
+/*
+ * _bt_binsrch_array_skey() -- Binary search for next matching array key
+ *
+ * cur_elem_start indicates if the binary search should begin at the array's
+ * current element (or have the current element as an upper bound if it's a
+ * backward scan).  This allows searches against required scan key arrays to
+ * reuse the work of earlier searches, at least in many important cases.
+ * Array keys covering key space that the index scan already processed cannot
+ * possibly contain any matches.
+ *
+ * XXX There are several fairly obvious optimizations that we could apply here
+ * (e.g., precheck searches for earlier subsets of a larger array would help).
+ * Revisit this during the next round of performance validation.
+ *
+ * Returns an index to the first array element >= caller's datum argument.
+ * Also sets *final_result to whatever _bt_compare_array_skey returned when we
+ * directly compared the returned array element to searched-for datum.
+ */
+static int
+_bt_binsrch_array_skey(ScanDirection dir, bool cur_elem_start,
+					   BTArrayKeyInfo *array, ScanKey cur,
+					   FmgrInfo *orderproc, Datum datum, bool null,
+					   int32 *final_result)
+{
+	int			low_elem,
+				high_elem,
+				first_elem_dir,
+				result = 0;
+	bool		knownequal = false;
+
+	Assert(cur->sk_flags & SK_SEARCHARRAY);
+	Assert(cur->sk_strategy == BTEqualStrategyNumber);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		first_elem_dir = 0;
+		low_elem = array->cur_elem;
+		high_elem = array->num_elems - 1;
+		if (cur_elem_start)
+			low_elem = 0;
+	}
+	else
+	{
+		first_elem_dir = array->num_elems - 1;
+		low_elem = 0;
+		high_elem = array->cur_elem;
+		if (cur_elem_start)
+		{
+			low_elem = 0;
+			high_elem = first_elem_dir;
+		}
+	}
+
+	while (high_elem > low_elem)
+	{
+		int			mid_elem = low_elem + ((high_elem - low_elem) / 2);
+		Datum		arrdatum = array->elem_values[mid_elem];
+
+		result = _bt_compare_array_skey(cur, orderproc, datum, null, arrdatum);
+
+		if (result == 0)
+		{
+			/*
+			 * Each array was deduplicated during initial preprocessing, so
+			 * there each element is guaranteed to be unique.  We can quit as
+			 * soon as we see an equal array, saving ourselves an extra
+			 * comparison or two...
+			 */
+			low_elem = mid_elem;
+			knownequal = true;
+			break;
+		}
+
+		if (result > 0)
+			low_elem = mid_elem + 1;
+		else
+			high_elem = mid_elem;
+	}
+
+	/*
+	 * ... but our caller also cares about the position of the searched-for
+	 * datum relative to the low_elem match we'll return.  Make sure that we
+	 * set *final_result to the result that comes from comparing low_elem's
+	 * key value to the datum that caller had us search for.
+	 */
+	if (!knownequal)
+		result = _bt_compare_array_skey(cur, orderproc, datum, null,
+										array->elem_values[low_elem]);
+
+	*final_result = result;
+
+	return low_elem;
+}
+
 /*
  * _bt_start_array_keys() -- Initialize array keys at start of a scan
  *
@@ -539,82 +758,22 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
 			curArrayKey->cur_elem = 0;
 		skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
 	}
-
-	so->arraysStarted = true;
-}
-
-/*
- * _bt_advance_array_keys() -- Advance to next set of array elements
- *
- * Returns true if there is another set of values to consider, false if not.
- * On true result, the scankeys are initialized with the next set of values.
- */
-bool
-_bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		found = false;
-	int			i;
-
-	/*
-	 * We must advance the last array key most quickly, since it will
-	 * correspond to the lowest-order index column among the available
-	 * qualifications. This is necessary to ensure correct ordering of output
-	 * when there are multiple array keys.
-	 */
-	for (i = so->numArrayKeys - 1; i >= 0; i--)
-	{
-		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
-		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
-		int			cur_elem = curArrayKey->cur_elem;
-		int			num_elems = curArrayKey->num_elems;
-
-		if (ScanDirectionIsBackward(dir))
-		{
-			if (--cur_elem < 0)
-			{
-				cur_elem = num_elems - 1;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
-		}
-		else
-		{
-			if (++cur_elem >= num_elems)
-			{
-				cur_elem = 0;
-				found = false;	/* need to advance next array key */
-			}
-			else
-				found = true;
-		}
-
-		curArrayKey->cur_elem = cur_elem;
-		skey->sk_argument = curArrayKey->elem_values[cur_elem];
-		if (found)
-			break;
-	}
-
-	/* advance parallel scan */
-	if (scan->parallel_scan != NULL)
-		_bt_parallel_advance_array_keys(scan);
-
-	/*
-	 * When no new array keys were found, the scan is "past the end" of the
-	 * array keys.  _bt_start_array_keys can still "restart" the array keys if
-	 * a rescan is required.
-	 */
-	if (!found)
-		so->arraysStarted = false;
-
-	return found;
 }
 
 /*
  * _bt_mark_array_keys() -- Handle array keys during btmarkpos
  *
  * Save the current state of the array keys as the "mark" position.
+ *
+ * XXX The current set of array keys are not independent of the current scan
+ * position, so why treat them that way?
+ *
+ * We shouldn't even bother remembering the current array keys when btmarkpos
+ * is called.  The array keys should be handled lazily instead.  If and when
+ * btrestrpos is called, it can just set every array's cur_elem to the first
+ * element for the current scan direction.  When _bt_advance_array_keys is
+ * reached (during the first call to _bt_checkkeys that follows), it will
+ * automatically search for the relevant array keys using caller's tuple.
  */
 void
 _bt_mark_array_keys(IndexScanDesc scan)
@@ -661,13 +820,8 @@ _bt_restore_array_keys(IndexScanDesc scan)
 	 * If we changed any keys, we must redo _bt_preprocess_keys.  That might
 	 * sound like overkill, but in cases with multiple keys per index column
 	 * it seems necessary to do the full set of pushups.
-	 *
-	 * Also do this whenever the scan's set of array keys "wrapped around" at
-	 * the end of the last primitive index scan.  There won't have been a call
-	 * to _bt_preprocess_keys from some other place following wrap around, so
-	 * we do it for ourselves.
 	 */
-	if (changed || !so->arraysStarted)
+	if (changed)
 	{
 		_bt_preprocess_keys(scan);
 		/* The mark should have been set on a consistent set of keys... */
@@ -675,6 +829,785 @@ _bt_restore_array_keys(IndexScanDesc scan)
 	}
 }
 
+/*
+ * Routine to determine if a continuescan=false tuple (set that way by an
+ * initial call to _bt_check_compare) might need to advance the scan's array
+ * keys.
+ *
+ * Returns true when caller passes a tuple that is < the current set of array
+ * keys for the most significant non-equal column/scan key (or > for backwards
+ * scans).  This means that it cannot possibly be time to advance the array
+ * keys just yet.  _bt_checkkeys caller should suppress its _bt_check_compare
+ * call, and return -- the tuple is treated as not satisfy our indexquals.
+ *
+ * Returns false when caller's tuple is >= the current array keys (or <=, in
+ * the case of backwards scans).  This means that it might be time for our
+ * caller to advance the array keys to the next set.
+ *
+ * Note: advancing the array keys may be required when every attribute value
+ * from caller's tuple is equal to corresponding scan key/array datums.  See
+ * comments at the start of _bt_advance_array_keys for more.
+ */
+static bool
+_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
+							 IndexTuple tuple)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	bool		tuple_before_array_keys = false;
+	ScanKey		cur;
+	int			ntupatts = BTreeTupleGetNAtts(tuple, rel),
+				ikey;
+
+	Assert(so->qual_ok);
+	Assert(so->numArrayKeys > 0);
+	Assert(so->numberOfKeys > 0);
+	Assert(!so->needPrimScan);
+
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		int			attnum = cur->sk_attno;
+		FmgrInfo   *orderproc;
+		Datum		datum;
+		bool		null,
+					skrequired;
+		int32		result;
+
+		/*
+		 * We only deal with equality strategy scan keys.  We leave handling
+		 * of inequalities up to _bt_check_compare.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Determine if this scan key is required in the current scan
+		 * direction
+		 */
+		skrequired = ((ScanDirectionIsForward(dir) &&
+					   (cur->sk_flags & SK_BT_REQFWD)) ||
+					  (ScanDirectionIsBackward(dir) &&
+					   (cur->sk_flags & SK_BT_REQBKWD)));
+
+		/*
+		 * Unlike _bt_advance_array_keys, we never deal with any non-required
+		 * array keys.  Cases where skrequiredtrigger is set to false by
+		 * _bt_check_compare should never call here.  We are only called after
+		 * _bt_check_compare provisionally indicated that the scan should be
+		 * terminated due to a _required_ scan key not being satisfied.
+		 *
+		 * We expect _bt_check_compare to notice and report required scan keys
+		 * before non-required ones.  _bt_advance_array_keys might still have
+		 * to advance non-required array keys in passing for a tuple that we
+		 * were called for, but _bt_advance_array_keys doesn't rely on us to
+		 * give it advanced notice of that.
+		 */
+		if (!skrequired)
+			break;
+
+		if (attnum > ntupatts)
+		{
+			/*
+			 * When we reach a high key's truncated attribute, assume that the
+			 * tuple attribute's value is >= the scan's search-type scan keys
+			 */
+			break;
+		}
+
+		datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+		orderproc = &so->orderProcs[attnum - 1];
+		result = _bt_compare_array_skey(cur, orderproc,
+										datum, null,
+										cur->sk_argument);
+
+		if (result != 0)
+		{
+			if (ScanDirectionIsForward(dir))
+				tuple_before_array_keys = result < 0;
+			else
+				tuple_before_array_keys = result > 0;
+
+			break;
+		}
+	}
+
+	return tuple_before_array_keys;
+}
+
+/*
+ * _bt_array_keys_remain() -- Start another primitive index scan?
+ *
+ * Returns true if _bt_checkkeys determined that another primitive index scan
+ * must take place by calling _bt_first.  Otherwise returns false, indicating
+ * that caller's top-level scan is now past the point where further matching
+ * index tuples can be found (for the current scan direction).
+ *
+ * Only call here during scans with one or more equality type array scan keys.
+ * All other scans should just call _bt_first once, no matter what.
+ *
+ * Top-level index scans executed via multiple primitive index scans must not
+ * fail to output index tuples in the usual order for the index -- just like
+ * any other index scan would.  The state machine that manages the scan's
+ * array keys must only start primitive index scans when they cover key space
+ * strictly greater than the key space for tuples that the scan has already
+ * returned (or strictly less in the backwards scan case).  Otherwise the scan
+ * could output the same index tuples more than once, or in the wrong order.
+ *
+ * This is managed by limiting the cases that can trigger new primitive index
+ * scans to those involving required array scan keys and/or other required
+ * scan keys that use the equality strategy.  In particular, the state machine
+ * must not allow high order required scan keys using an inequality strategy
+ * (which are only required in one scan direction) to directly trigger a new
+ * primitive index scan that advances low order non-required array scan keys.
+ * For example, a query such as "SELECT thousand, tenthous FROM tenk1 WHERE
+ * thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand" whose execution
+ * involves a scan of an index on "(thousand, tenthous)" must perform no more
+ * than a single primitive index scan.  Otherwise we risk outputting tuples in
+ * the wrong order.  Array key values for the non-required scan key on the
+ * "tenthous" column must not dictate top-level scan order.  Primitive index
+ * scans mustn't scan tuples already scanned by some earlier primitive scan.
+ *
+ * In fact, nbtree makes a stronger guarantee than is strictly necessary here:
+ * it guarantees that the top-level scan won't repeat any leaf page reads.
+ * (Actually, that can still happen when the scan is repositioned, or the scan
+ * direction changes -- but that's just as true with other types of scans.)
+ */
+bool
+_bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+
+	/*
+	 * Array keys are advanced within _bt_checkkeys when the scan reaches the
+	 * leaf level (more precisely, they're advanced when the scan reaches the
+	 * end of each distinct set of array elements).  This process avoids
+	 * repeat access to leaf pages (across multiple primitive index scans) by
+	 * opportunistically advancing the scan's array keys when it allows the
+	 * primitive index scan to find nearby matching tuples (or to eliminate
+	 * array keys with no matching tuples from further consideration).
+	 *
+	 * _bt_checkkeys sets a simple flag variable that we check here.  This
+	 * tells us if we need to perform another primitive index scan for the
+	 * now-current array keys or not.  We'll unset the flag once again to
+	 * acknowledge having started a new primitive scan (or we'll see that it
+	 * isn't set and end the top-level scan right away).
+	 *
+	 * We cannot rely on _bt_first always reaching _bt_checkkeys here.  There
+	 * are various scenarios where that won't happen.  For example, if the
+	 * index is completely empty, then _bt_first won't get as far as calling
+	 * _bt_readpage/_bt_checkkeys.
+	 *
+	 * We also don't expect _bt_checkkeys to be reached when searching for a
+	 * non-existent value that happens to be higher than any existing value in
+	 * the index.  No _bt_checkkeys are expected when _bt_readpage reads the
+	 * rightmost page during such a scan -- even a _bt_checkkeys call against
+	 * the high key won't happen.  There is an analogous issue for backwards
+	 * scans that search for a value lower than all existing index tuples.
+	 *
+	 * We don't actually require special handling for these cases -- we don't
+	 * need to be explicitly instructed to _not_ perform another primitive
+	 * index scan.  This is correct for all of the cases we've listed so far,
+	 * which all involve primitive index scans that access pages "near the
+	 * boundaries of the key space" (the leftmost page, the rightmost page, or
+	 * an imaginary empty leaf root page).  If _bt_checkkeys cannot be reached
+	 * by a primitive index scan for one set of array keys, it follows that it
+	 * also won't be reached for any later set of array keys.
+	 *
+	 * There is one exception: the case where _bt_first's _bt_preprocess_keys
+	 * call determined that the scan's input scan keys can never be satisfied.
+	 * That might be true for one set of array keys, but not the next set.
+	 */
+	if (!so->qual_ok)
+	{
+		/*
+		 * Qual can never be satisfied.  Advance our array keys incrementally.
+		 */
+		so->needPrimScan = false;
+		if (_bt_advance_array_keys_increment(scan, dir))
+			return true;
+	}
+
+	/* Time for another primitive index scan? */
+	if (so->needPrimScan)
+	{
+		/* Begin primitive index scan */
+		so->needPrimScan = false;
+
+		if (scan->parallel_scan != NULL)
+			_bt_parallel_next_primitive_scan(scan);
+
+		return true;
+	}
+
+	/*
+	 * No more primitive index scans.  Just terminate the top-level scan.
+	 */
+	_bt_advance_array_keys_to_end(scan, dir);
+
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_done(scan);
+
+	return false;
+}
+
+/*
+ * _bt_advance_array_keys() -- Advance array elements using a tuple
+ *
+ * Returns true if all required equality-type scan keys (in particular, those
+ * that are array keys) now have exact matching values to those from tuple.
+ * Returns false when the tuple isn't an exact match in this sense.
+ *
+ * Sets pstate.continuescan for caller when we return false.  When we return
+ * true it's up to caller to call _bt_check_compare to recheck the tuple.  It
+ * is okay to let the second call set pstate.continuescan=false without
+ * further intervention, since we know that it can only be for a scan key that
+ * is required in one direction.
+ *
+ * When called with skrequiredtrigger, we don't expect to have to advance any
+ * non-required scan keys.  We'll always set pstate.continuescan because a
+ * non-required scan key can never terminate the scan.
+ *
+ * Required array keys are always advanced to the highest element >= the
+ * corresponding tuple attribute values for its most significant non-equal
+ * column (or the next lowest set <= the tuple value during backwards scans).
+ * If we reach the end of the array keys for the current scan direction, we
+ * end the top-level index scan.
+ *
+ * _bt_tuple_before_array_skeys is responsible for determining if the current
+ * place in the scan is >= the current array keys (or <= during backward
+ * scans).  This must be established first, before calling here.
+ *
+ * Note that we may sometimes need to advance the array keys in spite of the
+ * existing array keys already being an exact match for every corresponding
+ * value from caller's tuple.  We fall back on "incrementally" advancing the
+ * array keys in these cases, which involve inequality strategy scan keys.
+ * For example, with a composite index on (a, b) and a qual "WHERE a IN (3,5)
+ * AND b < 42", we'll be called for both "a" arry keys (keys 3 and 5) when the
+ * scan reaches tuples where "b >= 42".  Even though "a" array keys continue
+ * to have exact matches for tuples "b >= 42" (for both array key groupings),
+ * we will still advance the array for "a" via our fallback on incremental
+ * advancement each time we're called.  The first time we're called (when the
+ * scan reaches a tuple >= "(3, 42)"), we advance the array key (from 3 to 5).
+ * This gives our caller the option of starting a new primitive index scan
+ * that quickly locates the start of tuples > "(5, -inf)".  The second time
+ * we're called (when the scan reaches a tuple >= "(5, 42)"), we incrementally
+ * advance the keys a second time.  This second call ends the top-level scan.
+ *
+ * Note also that we deal with all required equality-type scan keys here; it's
+ * not limited to array scan keys.  We need to handle non-array equality cases
+ * here because they're equality constraints for the scan, in the same way
+ * that array scan keys are.  We must not suppress cases where a call to
+ * _bt_check_compare sets continuescan=false for a required scan key that uses
+ * the equality strategy (only inequality-type scan keys get that treatment).
+ * We don't want to suppress the scan's termination when it's inappropriate.
+ */
+static bool
+_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
+					   IndexTuple tuple, bool skrequiredtrigger)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	ScanDirection dir = pstate->dir;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		cur;
+	int			ikey,
+				arrayidx = 0,
+				ntupatts = BTreeTupleGetNAtts(tuple, rel);
+	bool		arrays_advanced = false,
+				arrays_done = false,
+				all_skrequired_atts_wrapped = skrequiredtrigger,
+				all_atts_equal = true;
+
+	Assert(so->numberOfKeys > 0);
+	Assert(so->numArrayKeys > 0);
+	Assert(so->qual_ok);
+
+	/*
+	 * Try to advance array keys via a series of binary searches.
+	 *
+	 * Loop iterates through the current scankeys (so->keyData, which were
+	 * output by _bt_preprocess_keys earlier) and then sets input scan keys
+	 * (so->arrayKeyData scan keys) to new array values.  This sets things up
+	 * for our call to _bt_preprocess_keys, which is where the current scan
+	 * keys actually change.
+	 *
+	 * We need to do things this way because only current/preprocessed scan
+	 * keys will be marked as required.  It's also possible that the previous
+	 * call to _bt_preprocess_keys eliminated one or more input scan keys
+	 * (possibly array type scan keys) that were deemed to be redundant.
+	 */
+	for (cur = so->keyData, ikey = 0; ikey < so->numberOfKeys; cur++, ikey++)
+	{
+		BTArrayKeyInfo *array = NULL;
+		ScanKey		skeyarray = NULL;
+		FmgrInfo   *orderproc;
+		int			attnum = cur->sk_attno,
+					first_elem_dir,
+					final_elem_dir,
+					set_elem;
+		Datum		datum;
+		bool		skrequired,
+					null;
+		int32		result;
+
+		/*
+		 * We only deal with equality strategy scan keys.  We leave handling
+		 * of inequalities up to _bt_check_compare.
+		 */
+		if (cur->sk_strategy != BTEqualStrategyNumber)
+			continue;
+
+		/*
+		 * Determine if this scan key is required in the current scan
+		 * direction
+		 */
+		skrequired = ((ScanDirectionIsForward(dir) &&
+					   (cur->sk_flags & SK_BT_REQFWD)) ||
+					  (ScanDirectionIsBackward(dir) &&
+					   (cur->sk_flags & SK_BT_REQBKWD)));
+
+		/*
+		 * Optimization: we don't have to advance remaining non-required array
+		 * keys when we already know that tuple won't be returned by the scan.
+		 *
+		 * Deliberately check this both here and after the binary search.
+		 */
+		if (!skrequired && !all_atts_equal)
+			break;
+
+		/*
+		 * We need to check required non-array scan keys (that use the equal
+		 * strategy), as well as required and non-required array scan keys
+		 * (also limited to those that use the equal strategy, since array
+		 * inequalities degenerate into a simple comparison).
+		 *
+		 * Perform initial set up for this scan key.  If it is backed by an
+		 * array then we need to set variables describing the current position
+		 * in the array.
+		 */
+		orderproc = &so->orderProcs[attnum - 1];
+		first_elem_dir = final_elem_dir = 0;		/* keep compiler quiet */
+		if (cur->sk_flags & SK_SEARCHARRAY)
+		{
+			/* Set up array comparison function */
+			Assert(arrayidx < so->numArrayKeys);
+			array = &so->arrayKeys[arrayidx++];
+			skeyarray = &so->arrayKeyData[array->scan_key];
+
+			/*
+			 * It's possible that _bt_preprocess_keys determined that an
+			 * individual array scan key wasn't required in so->keyData for
+			 * the ongoing primitive index scan due to it being redundant or
+			 * contradictory (the current array value might be redundant next
+			 * to some other scan key on the same attribute).  Deal with that.
+			 */
+			if (unlikely(skeyarray->sk_attno != attnum))
+			{
+				bool		found PG_USED_FOR_ASSERTS_ONLY = false;
+
+				for (; arrayidx < so->numArrayKeys; arrayidx++)
+				{
+					array = &so->arrayKeys[arrayidx];
+					skeyarray = &so->arrayKeyData[array->scan_key];
+					if (skeyarray->sk_attno == attnum)
+					{
+						found = true;
+						break;
+					}
+				}
+
+				Assert(found);
+			}
+
+			/* Proactively set up state used to handle array wraparound */
+			if (ScanDirectionIsForward(dir))
+			{
+				first_elem_dir = 0;
+				final_elem_dir = array->num_elems - 1;
+			}
+			else
+			{
+				first_elem_dir = array->num_elems - 1;
+				final_elem_dir = 0;
+			}
+		}
+		else if (attnum > ntupatts)
+		{
+			/*
+			 * Nothing needs to be done when we have a truncated attribute
+			 * (possible when caller's tuple is a page high key) and a
+			 * non-array scan key
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
+		/*
+		 * Here we perform steps for any required scan keys after the first
+		 * non-equal required scan key.  The first scan key must have been set
+		 * to a value > the value from the tuple back when we dealt with it
+		 * (or, for a backwards scan, to a value < the value from the tuple).
+		 * That needs to "cascade" to lower-order array scan keys.  They must
+		 * be set to the first array element for the current scan direction.
+		 *
+		 * We're still setting the keys to values >= the tuple here -- it just
+		 * needs to work for the tuple as a whole.  For example, when a tuple
+		 * "(a, b) = (42, 5)" advances the array keys on "a" from 40 to 45, we
+		 * must also set "b" to whatever the first array element for "b" is.
+		 * It would be wrong to allow "b" to be set to a value from the tuple,
+		 * since the value is actually from a different part of the key space.
+		 *
+		 * Also defensively do this with truncated attributes when caller's
+		 * tuple is a page high key.
+		 */
+		if (array && ((arrays_advanced && !all_atts_equal) ||
+					  attnum > ntupatts))
+		{
+			/* Shouldn't reach this far for a non-required scan key */
+			Assert(skrequired && skrequiredtrigger && attnum > 1);
+
+			/*
+			 * We set the array to the first element (if needed) here, and we
+			 * don't unset all_required_atts_wrapped.  This array therefore
+			 * counts as a wrapped array when we go on to determine if all of
+			 * the required arrays have wrapped (after this loop).
+			 */
+			if (array->cur_elem != first_elem_dir)
+			{
+				array->cur_elem = first_elem_dir;
+				skeyarray->sk_argument = array->elem_values[first_elem_dir];
+				arrays_advanced = true;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Going to compare scan key to corresponding tuple attribute value
+		 */
+		datum = index_getattr(tuple, attnum, itupdesc, &null);
+
+		if (!array)
+		{
+			if (!skrequired || !all_atts_equal)
+				continue;
+
+			/*
+			 * This is a required non-array scan key that uses the equal
+			 * strategy.  See header comments for an explanation of why we
+			 * need to do this.
+			 */
+			result = _bt_compare_array_skey(cur, orderproc, datum, null,
+											cur->sk_argument);
+
+			if (result != 0)
+			{
+				/*
+				 * tuple attribute value is > scan key value (or < scan key
+				 * value in the backward scan case).
+				 */
+				all_atts_equal = false;
+				break;
+			}
+
+			continue;
+		}
+
+		/*
+		 * Binary search for an array key >= the tuple value, which we'll then
+		 * set as our current array key (or <= the tuple value if this is a
+		 * backward scan).
+		 *
+		 * The binary search excludes array keys that we've already processed
+		 * from consideration, except with a non-required scan key's array.
+		 * This is not just an optimization -- it's important for correctness.
+		 * It is crucial that required array scan keys only have their array
+		 * keys advanced in the current scan direction.  We need to advance
+		 * required array keys in lock step with the index scan.
+		 *
+		 * Note in particular that arrays_advanced must only be set when the
+		 * array is advanced to a key >= the existing key, or <= for a
+		 * backwards scan.  (Though see notes about wraparound below.)
+		 */
+		set_elem = _bt_binsrch_array_skey(dir, (!skrequired || arrays_advanced),
+										  array, cur, orderproc, datum, null,
+										  &result);
+
+		/*
+		 * Maintain the state that tracks whether all attribute from the tuple
+		 * are equal to the array keys that we've set as current (or existing
+		 * array keys set during earlier calls here).
+		 */
+		if (result != 0)
+			all_atts_equal = false;
+
+		/*
+		 * Optimization: we don't have to advance remaining non-required array
+		 * keys when we already know that tuple won't be returned by the scan.
+		 * Quit before setting the array keys to avoid _bt_preprocess_keys.
+		 *
+		 * Deliberately check this both before and after the binary search.
+		 */
+		if (!skrequired && !all_atts_equal)
+			break;
+
+		/*
+		 * If the binary search indicates that the key space for this tuple
+		 * attribute value is > the key value from the final element in the
+		 * array (final for the current scan direction), we handle it by
+		 * wrapping around to the first element of the array.
+		 *
+		 * Wrapping around simplifies advancement with a multi-column index by
+		 * allowing us to treat wrapping a column as advancing the column.  We
+		 * preserve the invariant that a required scan key's array may only be
+		 * ratcheted forward (backwards when the scan direction is backwards),
+		 * while still always being able to "advance" the array at this point.
+		 */
+		if (set_elem == final_elem_dir &&
+			((ScanDirectionIsForward(dir) && result > 0) ||
+			 (ScanDirectionIsBackward(dir) && result < 0)))
+		{
+			/* Perform wraparound */
+			set_elem = first_elem_dir;
+		}
+		else if (skrequired)
+		{
+			/* Won't call _bt_advance_array_keys_to_end later */
+			all_skrequired_atts_wrapped = false;
+		}
+
+		Assert(set_elem >= 0 && set_elem < array->num_elems);
+		if (array->cur_elem != set_elem)
+		{
+			array->cur_elem = set_elem;
+			skeyarray->sk_argument = array->elem_values[set_elem];
+			arrays_advanced = true;
+
+			/*
+			 * We shouldn't have to advance a required array when called due
+			 * to _bt_check_compare determining that a non-required array
+			 * needs to be advanced.  We expect _bt_check_compare to notice
+			 * and report required scan keys before non-required ones.
+			 */
+			Assert(skrequiredtrigger || !skrequired);
+		}
+	}
+
+	if (!skrequiredtrigger)
+	{
+		/*
+		 * Failing to satisfy a non-required array scan key shouldn't ever
+		 * result in terminating the (primitive) index scan
+		 */
+	}
+	else if (all_skrequired_atts_wrapped)
+	{
+		/*
+		 * The binary searches for each tuple's attribute value in the scan
+		 * key's corresponding SK_SEARCHARRAY array all found that the tuple's
+		 * value are "past the end" of the key space covered by each array
+		 */
+		_bt_advance_array_keys_to_end(scan, dir);
+		arrays_done = true;
+		all_atts_equal = false; /* at least not now */
+	}
+	else if (!arrays_advanced)
+	{
+		/*
+		 * We must always advance the array keys by at least one increment
+		 * (except when called to advance a non-required scan key's array).
+		 *
+		 * We need this fallback for cases where the existing array keys and
+		 * existing required equal-strategy scan keys were fully equal to the
+		 * tuple.  _bt_check_compare may have set continuescan=false due to an
+		 * inequality terminating the scan, which we don't deal with directly.
+		 * (See function's header comments for an example.)
+		 */
+		if (_bt_advance_array_keys_increment(scan, dir))
+			arrays_advanced = true;
+		else
+			arrays_done = true;
+		all_atts_equal = false; /* at least not now */
+	}
+
+	/*
+	 * Might make sense to recheck the high key later on in cases where we
+	 * just advanced the keys (unless we were just called to advance the
+	 * scan's non-required array keys)
+	 */
+	if (arrays_advanced && skrequiredtrigger)
+		pstate->highkeychecked = false;
+
+	/*
+	 * If we changed the array keys without exhausting all array keys then we
+	 * need to preprocess our search-type scan keys once more
+	 */
+	Assert(skrequiredtrigger || !arrays_done);
+	if (arrays_advanced && !arrays_done)
+	{
+		/*
+		 * XXX Think about buffer-lock-held hazards here some more.
+		 *
+		 * In almost all interesting cases we only really need to copy over
+		 * the array values (from "so->arrayKeyData" to "so->keyData").  But
+		 * there are at least some cases where performing the full set of push
+		 * ups here (or close to it) might add value over just doing it for
+		 * the main _bt_first call.
+		 */
+		_bt_preprocess_keys(scan);
+	}
+
+	/* Are we now done with the top-level scan (barring a btrescan)? */
+	Assert(!so->needPrimScan);
+	if (!so->qual_ok)
+	{
+		/*
+		 * Increment array keys and start a new primitive index scan if
+		 * _bt_preprocess_keys() discovered that the scan keys can never be
+		 * satisfied (eg, x == 2 AND x in (1, 2, 3) for array keys 1 and 2).
+		 *
+		 * Note: There is similar handling in _bt_array_keys_remain, which
+		 * must advance the array keys without consulting us in this one case.
+		 */
+		Assert(skrequiredtrigger);
+
+		pstate->continuescan = false;
+		pstate->highkeychecked = true;
+		all_atts_equal = false; /* at least not now */
+
+		if (_bt_advance_array_keys_increment(scan, dir))
+			so->needPrimScan = true;
+	}
+	else if (!skrequiredtrigger)
+	{
+		/* Not when we failed to satisfy a non-required scan key, ever */
+		Assert(!arrays_done);
+		pstate->continuescan = true;
+	}
+	else if (arrays_done)
+	{
+		/*
+		 * Yep -- this primitive scan was our last
+		 */
+		Assert(!all_atts_equal);
+		pstate->continuescan = false;
+	}
+	else if (!all_atts_equal)
+	{
+		/*
+		 * Not done.  The top-level index scan (and primitive index scan) will
+		 * continue, since the array keys advanced.
+		 */
+		Assert(arrays_advanced);
+		pstate->continuescan = true;
+
+		/*
+		 * Some required array keys might have wrapped around during this
+		 * call, but it can't have been the most significant array scan key.
+		 */
+		Assert(!all_skrequired_atts_wrapped);
+	}
+	else
+	{
+		/*
+		 * Not done.  A second call to _bt_check_compare must now take place.
+		 * It will make the final decision on setting continuescan.
+		 */
+	}
+
+	return all_atts_equal;
+}
+
+/*
+ * Advance the array keys by a single increment in the current scan direction
+ */
+static bool
+_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		found = false;
+	int			i;
+
+	Assert(!so->needPrimScan);
+
+	/*
+	 * We must advance the last array key most quickly, since it will
+	 * correspond to the lowest-order index column among the available
+	 * qualifications. This is necessary to ensure correct ordering of output
+	 * when there are multiple array keys.
+	 */
+	for (i = so->numArrayKeys - 1; i >= 0; i--)
+	{
+		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		int			cur_elem = curArrayKey->cur_elem;
+		int			num_elems = curArrayKey->num_elems;
+
+		if (ScanDirectionIsBackward(dir))
+		{
+			if (--cur_elem < 0)
+			{
+				cur_elem = num_elems - 1;
+				found = false;	/* need to advance next array key */
+			}
+			else
+				found = true;
+		}
+		else
+		{
+			if (++cur_elem >= num_elems)
+			{
+				cur_elem = 0;
+				found = false;	/* need to advance next array key */
+			}
+			else
+				found = true;
+		}
+
+		curArrayKey->cur_elem = cur_elem;
+		skey->sk_argument = curArrayKey->elem_values[cur_elem];
+		if (found)
+			break;
+	}
+
+	return found;
+}
+
+/*
+ * Perform final steps when the "end point" is reached on the leaf level
+ * without any call to _bt_checkkeys setting *continuescan to false.
+ */
+static void
+_bt_advance_array_keys_to_end(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(so->numArrayKeys);
+	Assert(!so->needPrimScan);
+
+	for (int i = 0; i < so->numArrayKeys; i++)
+	{
+		BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
+		ScanKey		skey = &so->arrayKeyData[curArrayKey->scan_key];
+		int			reset_elem;
+
+		if (ScanDirectionIsForward(dir))
+			reset_elem = curArrayKey->num_elems - 1;
+		else
+			reset_elem = 0;
+
+		if (curArrayKey->cur_elem != reset_elem)
+		{
+			curArrayKey->cur_elem = reset_elem;
+			skey->sk_argument = curArrayKey->elem_values[reset_elem];
+		}
+	}
+}
 
 /*
  *	_bt_preprocess_keys() -- Preprocess scan keys
@@ -1360,38 +2293,204 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * Return true if so, false if not.  If the tuple fails to pass the qual,
  * we also determine whether there's any need to continue the scan beyond
- * this tuple, and set *continuescan accordingly.  See comments for
+ * this tuple, and set pstate.continuescan accordingly.  See comments for
  * _bt_preprocess_keys(), above, about how this is done.
  *
- * Forward scan callers can pass a high key tuple in the hopes of having
- * us set *continuescan to false, and avoiding an unnecessary visit to
- * the page to the right.
+ * Forward scan callers can pass a high key tuple in the hopes of having us
+ * set pstate.continuescan to false, and avoiding an unnecessary visit to the
+ * page to the right.
+ *
+ * Forwards scan callers with equality type array scan keys are obligated to
+ * set up page state in a way that makes it possible for us to check the high
+ * key early, before we've expended too much effort on comparing tuples that
+ * cannot possibly be matches for any set of array keys.  This is just an
+ * optimization.
+ *
+ * Advances the current set of array keys for SK_SEARCHARRAY scans where
+ * appropriate.  These callers are required to initialize the page level high
+ * key in pstate before the first call here for the page (when the scan
+ * direction is forwards).  Note that we rely on _bt_readpage calling here in
+ * page offset number order (for its scan direction).  Any other order will
+ * lead to inconsistent array key state.
  *
  * scan: index scan descriptor (containing a search-type scankey)
+ * pstate: Page level input and output parameters
  * tuple: index tuple to test
- * tupnatts: number of attributes in tupnatts (high key may be truncated)
- * dir: direction we are scanning in
- * continuescan: output parameter (will be set correctly in all cases)
+ * finaltup: Is tuple the final one we'll be called with for this page?
  */
 bool
-_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan)
+_bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate,
+			  IndexTuple tuple, bool finaltup)
+{
+	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
+	int			natts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+	bool		skrequiredtrigger;
+
+	Assert(so->qual_ok);
+	Assert(pstate->continuescan);
+	Assert(!so->needPrimScan);
+
+	res = _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+							tuple, natts, tupdesc,
+							&pstate->continuescan, &skrequiredtrigger);
+
+	/*
+	 * Only one _bt_check_compare call is required in the common case where
+	 * there are no equality-type array scan keys.
+	 *
+	 * When there are array scan keys then we can still accept the first
+	 * answer we get from _bt_check_compare when continuescan wasn't unset.
+	 */
+	if (!so->numArrayKeys || pstate->continuescan)
+		return res;
+
+	/*
+	 * _bt_check_compare set continuescan=false in the presence of equality
+	 * type array keys.  It's possible that we haven't reached the start of
+	 * the array keys just yet.  It's also possible that we need to advance
+	 * the array keys now.  (Or perhaps we really do need to terminate the
+	 * top-level scan.)
+	 */
+	pstate->continuescan = true;	/* new initial assumption */
+
+	if (skrequiredtrigger && _bt_tuple_before_array_skeys(scan, pstate, tuple))
+	{
+		/*
+		 * Tuple is still < the current array scan key values (as well as
+		 * other equality type scan keys) if this is a forward scan.
+		 * (Backwards scans reach here with a tuple > equality constraints.)
+		 * We must now consider how to proceed with the ongoing primitive
+		 * index scan.
+		 *
+		 * Should _bt_readpage continue with this page for now, in the hope of
+		 * finding tuples whose key space is covered by the current array keys
+		 * before too long?  Or, should it give up and start a new primitive
+		 * index scan instead?
+		 *
+		 * Our policy is to terminate the primitive index scan at the end of
+		 * the current page if the current (most recently advanced) array keys
+		 * don't cover the final tuple from the page.  This policy is fairly
+		 * conservative.
+		 *
+		 * Note: In some cases we're effectively speculating that the next
+		 * sibling leaf page will have tuples that are covered by the key
+		 * space of our array keys (the current set or some nearby set), based
+		 * on a cue from the current page's final tuple.  There is at least a
+		 * non-zero risk of wasting a page access -- we could gamble and lose.
+		 * The details of all this are handled within _bt_advance_array_keys.
+		 */
+		if (finaltup || (!pstate->highkeychecked && pstate->highkey &&
+						 _bt_tuple_before_array_skeys(scan, pstate,
+													  pstate->highkey)))
+		{
+			/*
+			 * This is the final tuple (the high key for forward scans, or the
+			 * tuple at the first offset number for backward scans), but it is
+			 * still before the current array keys.  As such, we're unwilling
+			 * to allow the current primitive index scan to continue to the
+			 * next leaf page.
+			 *
+			 * Start a new primitive index scan.  The next primitive index
+			 * scan (in the next _bt_first call) is expected to reposition the
+			 * scan to some much later leaf page.  (If we had a good reason to
+			 * think that the next leaf page that will be scanned will turn
+			 * out to be close to our current position, then we wouldn't be
+			 * starting another primitive index scan.)
+			 *
+			 * Note: _bt_readpage stashes the page high key, which allows us
+			 * to make this check early (for forward scans).  We thereby avoid
+			 * scanning very many extra tuples on the page.  This is just an
+			 * optimization; skipping these useless comparisons should never
+			 * change our final conclusion about what the scan should do next.
+			 */
+			pstate->continuescan = false;
+			so->needPrimScan = true;
+		}
+		else if (!finaltup && pstate->highkey)
+		{
+			/*
+			 * Remember that the high key has been checked with this
+			 * particular set of array keys.
+			 *
+			 * It might make sense to check the same high key again at some
+			 * point during the ongoing _bt_readpage-wise scan of this page.
+			 * But it is definitely wasteful to repeat the same high key check
+			 * before the array keys are advanced by some later tuple.
+			 */
+			pstate->highkeychecked = true;
+		}
+
+		/*
+		 * In any case, this indextuple doesn't match the qual
+		 */
+		return false;
+	}
+
+	/*
+	 * Caller's tuple is >= the current set of array keys and other equality
+	 * constraint scan keys (or <= if this is a backwards scans).
+	 *
+	 * It might be time to advance the array keys to the next set.  Try doing
+	 * that now, while determining in passing if the tuple matches the newly
+	 * advanced set of array keys (if we've any left).
+	 *
+	 * This call will also set continuescan for us (or tells us to perform
+	 * another _bt_check_compare call, which then sets continuescan for us).
+	 */
+	if (!_bt_advance_array_keys(scan, pstate, tuple, skrequiredtrigger))
+	{
+		/*
+		 * Tuple doesn't match any later array keys, either (for one or more
+		 * array type scan keys marked as required).  Give up on this tuple
+		 * being a match.  (Call may have also terminated the primitive scan,
+		 * or the top-level scan.)
+		 */
+		return false;
+	}
+
+	/*
+	 * Advanced array keys to values that are exact matches for corresponding
+	 * attribute values from the tuple.
+	 *
+	 * It's fairly likely that the tuple satisfies all index scan conditions
+	 * at this point, but we need confirmation of that.  We also need to give
+	 * _bt_check_compare a real opportunity to end the top-level index scan by
+	 * setting continuescan=false.  (_bt_advance_array_keys cannot deal with
+	 * inequality strategy scan keys; we need _bt_check_compare for those.)
+	 */
+	return _bt_check_compare(pstate->dir, so->keyData, so->numberOfKeys,
+							 tuple, natts, tupdesc,
+							 &pstate->continuescan, &skrequiredtrigger);
+}
+
+/*
+ * Test whether an indextuple satisfies current scan condition.
+ *
+ * Return true if so, false if not.  If not, also clear *continuescan if
+ * it's not possible for any future tuples in the current scan direction to
+ * pass the qual with the current set of array keys.
+ *
+ * This is a subroutine for _bt_checkkeys.  It is written with the assumption
+ * that reaching the end of each distinct set of array keys terminates the
+ * ongoing primitive index scan.  It is up to our caller (that has more
+ * context than we have available here) to override that initial determination
+ * when it makes more sense to advance the array keys and continue with
+ * further tuples from the same leaf page.
+ */
+static bool
+_bt_check_compare(ScanDirection dir, ScanKey keyData, int keysz,
+				  IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
+				  bool *continuescan, bool *skrequiredtrigger)
 {
-	TupleDesc	tupdesc;
-	BTScanOpaque so;
-	int			keysz;
 	int			ikey;
 	ScanKey		key;
 
-	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
-
 	*continuescan = true;		/* default assumption */
+	*skrequiredtrigger = true;	/* default assumption */
 
-	tupdesc = RelationGetDescr(scan->indexRelation);
-	so = (BTScanOpaque) scan->opaque;
-	keysz = so->numberOfKeys;
-
-	for (key = so->keyData, ikey = 0; ikey < keysz; key++, ikey++)
+	for (key = keyData, ikey = 0; ikey < keysz; key++, ikey++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -1512,6 +2611,10 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * qual fails, it is critical that equality quals be used for the
 			 * initial positioning in _bt_first() when they are available. See
 			 * comments in _bt_first().
+			 *
+			 * Scans with equality-type array scan keys run into a similar
+			 * problem whenever they advance the array keys.  Our caller uses
+			 * _bt_tuple_before_array_skeys to avoid the problem there.
 			 */
 			if ((key->sk_flags & SK_BT_REQFWD) &&
 				ScanDirectionIsForward(dir))
@@ -1520,6 +2623,14 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_SEARCHARRAY) &&
+				key->sk_strategy == BTEqualStrategyNumber)
+			{
+				if (*continuescan)
+					*skrequiredtrigger = false;
+				*continuescan = false;
+			}
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1538,7 +2649,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  * it's not possible for any future tuples in the current scan direction
  * to pass the qual.
  *
- * This is a subroutine for _bt_checkkeys, which see for more info.
+ * This is a subroutine for _bt_check_compare/_bt_checkkeys_compare.
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 6a93d767a..f04ca1ee9 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -106,8 +106,7 @@ static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 							   IndexOptInfo *index, IndexClauseSet *clauses,
 							   bool useful_predicate,
 							   ScanTypeControl scantype,
-							   bool *skip_nonnative_saop,
-							   bool *skip_lower_saop);
+							   bool *skip_nonnative_saop);
 static List *build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 								List *clauses, List *other_clauses);
 static List *generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -706,8 +705,6 @@ eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids,
  * index AM supports them natively, we should just include them in simple
  * index paths.  If not, we should exclude them while building simple index
  * paths, and then make a separate attempt to include them in bitmap paths.
- * Furthermore, we should consider excluding lower-order ScalarArrayOpExpr
- * quals so as to create ordered paths.
  */
 static void
 get_index_paths(PlannerInfo *root, RelOptInfo *rel,
@@ -716,37 +713,17 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *indexpaths;
 	bool		skip_nonnative_saop = false;
-	bool		skip_lower_saop = false;
 	ListCell   *lc;
 
 	/*
 	 * Build simple index paths using the clauses.  Allow ScalarArrayOpExpr
-	 * clauses only if the index AM supports them natively, and skip any such
-	 * clauses for index columns after the first (so that we produce ordered
-	 * paths if possible).
+	 * clauses only if the index AM supports them natively.
 	 */
 	indexpaths = build_index_paths(root, rel,
 								   index, clauses,
 								   index->predOK,
 								   ST_ANYSCAN,
-								   &skip_nonnative_saop,
-								   &skip_lower_saop);
-
-	/*
-	 * If we skipped any lower-order ScalarArrayOpExprs on an index with an AM
-	 * that supports them, then try again including those clauses.  This will
-	 * produce paths with more selectivity but no ordering.
-	 */
-	if (skip_lower_saop)
-	{
-		indexpaths = list_concat(indexpaths,
-								 build_index_paths(root, rel,
-												   index, clauses,
-												   index->predOK,
-												   ST_ANYSCAN,
-												   &skip_nonnative_saop,
-												   NULL));
-	}
+								   &skip_nonnative_saop);
 
 	/*
 	 * Submit all the ones that can form plain IndexScan plans to add_path. (A
@@ -784,7 +761,6 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									   index, clauses,
 									   false,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		*bitindexpaths = list_concat(*bitindexpaths, indexpaths);
 	}
@@ -817,27 +793,19 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
  * to true if we found any such clauses (caller must initialize the variable
  * to false).  If it's NULL, we do not ignore ScalarArrayOpExpr clauses.
  *
- * If skip_lower_saop is non-NULL, we ignore ScalarArrayOpExpr clauses for
- * non-first index columns, and we set *skip_lower_saop to true if we found
- * any such clauses (caller must initialize the variable to false).  If it's
- * NULL, we do not ignore non-first ScalarArrayOpExpr clauses, but they will
- * result in considering the scan's output to be unordered.
- *
  * 'rel' is the index's heap relation
  * 'index' is the index for which we want to generate paths
  * 'clauses' is the collection of indexable clauses (IndexClause nodes)
  * 'useful_predicate' indicates whether the index has a useful predicate
  * 'scantype' indicates whether we need plain or bitmap scan support
  * 'skip_nonnative_saop' indicates whether to accept SAOP if index AM doesn't
- * 'skip_lower_saop' indicates whether to accept non-first-column SAOP
  */
 static List *
 build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
 				  ScanTypeControl scantype,
-				  bool *skip_nonnative_saop,
-				  bool *skip_lower_saop)
+				  bool *skip_nonnative_saop)
 {
 	List	   *result = NIL;
 	IndexPath  *ipath;
@@ -848,7 +816,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *orderbyclausecols;
 	List	   *index_pathkeys;
 	List	   *useful_pathkeys;
-	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
@@ -880,19 +847,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * on by btree and possibly other places.)  The list can be empty, if the
 	 * index AM allows that.
 	 *
-	 * found_lower_saop_clause is set true if we accept a ScalarArrayOpExpr
-	 * index clause for a non-first index column.  This prevents us from
-	 * assuming that the scan result is ordered.  (Actually, the result is
-	 * still ordered if there are equality constraints for all earlier
-	 * columns, but it seems too expensive and non-modular for this code to be
-	 * aware of that refinement.)
-	 *
 	 * We also build a Relids set showing which outer rels are required by the
 	 * selected clauses.  Any lateral_relids are included in that, but not
 	 * otherwise accounted for.
 	 */
 	index_clauses = NIL;
-	found_lower_saop_clause = false;
 	outer_relids = bms_copy(rel->lateral_relids);
 	for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++)
 	{
@@ -917,16 +876,6 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 					/* Caller had better intend this only for bitmap scan */
 					Assert(scantype == ST_BITMAPSCAN);
 				}
-				if (indexcol > 0)
-				{
-					if (skip_lower_saop)
-					{
-						/* Caller doesn't want to lose index ordering */
-						*skip_lower_saop = true;
-						continue;
-					}
-					found_lower_saop_clause = true;
-				}
 			}
 
 			/* OK to include this clause */
@@ -956,11 +905,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	/*
 	 * 2. Compute pathkeys describing index's ordering, if any, then see how
 	 * many of them are actually useful for this query.  This is not relevant
-	 * if we are only trying to build bitmap indexscans, nor if we have to
-	 * assume the scan is unordered.
+	 * if we are only trying to build bitmap indexscans.
 	 */
 	pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
-								!found_lower_saop_clause &&
 								has_useful_pathkeys(root, rel));
 	index_is_ordered = (index->sortopfamily != NULL);
 	if (index_is_ordered && pathkeys_possibly_useful)
@@ -1212,7 +1159,6 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
 									   index, &clauseset,
 									   useful_predicate,
 									   ST_BITMAPSCAN,
-									   NULL,
 									   NULL);
 		result = list_concat(result, indexpaths);
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076..c796b53a6 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6444,8 +6444,6 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
 	List	   *selectivityQuals;
@@ -6460,7 +6458,7 @@ genericcostestimate(PlannerInfo *root,
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * primitive index scans that will be performed for caller
 	 */
 	num_sa_scans = 1;
 	foreach(l, indexQuals)
@@ -6490,19 +6488,8 @@ genericcostestimate(PlannerInfo *root,
 	 */
 	numIndexTuples = costs->numIndexTuples;
 	if (numIndexTuples <= 0.0)
-	{
 		numIndexTuples = indexSelectivity * index->rel->tuples;
 
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
 	/*
 	 * We can bound the number of tuples by the index size in any case. Also,
 	 * always estimate at least one tuple is touched, even when
@@ -6540,27 +6527,31 @@ genericcostestimate(PlannerInfo *root,
 	 *
 	 * The above calculations are all per-index-scan.  However, if we are in a
 	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
+	 * different search keys) for each row of the outer relation.  This
+	 * creates the potential for cache effects to reduce the number of disk
+	 * page fetches needed.  We want to estimate the average per-scan I/O cost
+	 * in the presence of caching.
 	 *
 	 * We use the Mackert-Lohman formula (see costsize.c for details) to
 	 * estimate the total number of page fetches that occur.  While this
 	 * wasn't what it was designed for, it seems a reasonable model anyway.
 	 * Note that we are counting pages not tuples anymore, so we take N = T =
 	 * index size, as if there were one "tuple" per page.
+	 *
+	 * Note: we assume that there will be no repeat index page fetches across
+	 * ScalarArrayOpExpr primitive scans from the same logical index scan.
+	 * This is guaranteed to be true for btree indexes, but is very optimistic
+	 * with index AMs that cannot natively execute ScalarArrayOpExpr quals.
+	 * However, these same index AMs also accept our default pessimistic
+	 * approach to counting num_sa_scans (btree caller caps this), so we don't
+	 * expect the final indexTotalCost to be wildly over-optimistic.
 	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
+	if (loop_count > 1)
 	{
 		double		pages_fetched;
 
 		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
+		pages_fetched = numIndexPages * loop_count;
 
 		/* use Mackert and Lohman formula to adjust for cache effects */
 		pages_fetched = index_pages_fetched(pages_fetched,
@@ -6570,11 +6561,9 @@ genericcostestimate(PlannerInfo *root,
 
 		/*
 		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
+		 * share for each outer scan
 		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
+		indexTotalCost = (pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -6590,10 +6579,8 @@ genericcostestimate(PlannerInfo *root,
 	 * evaluated once at the start of the scan to reduce them to runtime keys
 	 * to pass to the index AM (see nodeIndexscan.c).  We model the per-tuple
 	 * CPU costs as cpu_index_tuple_cost plus one cpu_operator_cost per
-	 * indexqual operator.  Because we have numIndexTuples as a per-scan
-	 * number, we have to multiply by num_sa_scans to get the correct result
-	 * for ScalarArrayOpExpr cases.  Similarly add in costs for any index
-	 * ORDER BY expressions.
+	 * indexqual operator.  Similarly add in costs for any index ORDER BY
+	 * expressions.
 	 *
 	 * Note: this neglects the possible costs of rechecking lossy operators.
 	 * Detecting that that might be needed seems more expensive than it's
@@ -6606,7 +6593,7 @@ genericcostestimate(PlannerInfo *root,
 
 	indexStartupCost = qual_arg_cost;
 	indexTotalCost += qual_arg_cost;
-	indexTotalCost += numIndexTuples * num_sa_scans * (cpu_index_tuple_cost + qual_op_cost);
+	indexTotalCost += numIndexTuples * (cpu_index_tuple_cost + qual_op_cost);
 
 	/*
 	 * Generic assumption about index correlation: there isn't any.
@@ -6684,7 +6671,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	bool		eqQualHere;
 	bool		found_saop;
 	bool		found_is_null_op;
-	double		num_sa_scans;
 	ListCell   *lc;
 
 	/*
@@ -6699,17 +6685,12 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 *
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
-	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
 	found_is_null_op = false;
-	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
 	{
 		IndexClause *iclause = lfirst_node(IndexClause, lc);
@@ -6749,14 +6730,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			else if (IsA(clause, ScalarArrayOpExpr))
 			{
 				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
-				Node	   *other_operand = (Node *) lsecond(saop->args);
-				int			alength = estimate_array_length(other_operand);
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
-					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
 			{
@@ -6805,9 +6781,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		Selectivity btreeSelectivity;
 
 		/*
-		 * If the index is partial, AND the index predicate with the
-		 * index-bound quals to produce a more accurate idea of the number of
-		 * rows covered by the bound conditions.
+		 * AND the index predicate with the index-bound quals to produce a
+		 * more accurate idea of the number of rows covered by the bound
+		 * conditions
 		 */
 		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
 
@@ -6816,13 +6792,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 												  JOIN_INNER,
 												  NULL);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
-
-		/*
-		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
-		 * to integer.
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
 
 	/*
@@ -6832,6 +6801,43 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 	genericcostestimate(root, path, loop_count, &costs);
 
+	/*
+	 * Now compensate for btree's ability to efficiently execute scans with
+	 * SAOP clauses.
+	 *
+	 * btree automatically combines individual ScalarArrayOpExpr primitive
+	 * index scans whenever the tuples covered by the next set of array keys
+	 * are close to tuples covered by the current set.  This makes the final
+	 * number of descents particularly difficult to estimate.  However, btree
+	 * scans never visit any single leaf page more than once.  That puts a
+	 * natural floor under the worst case number of descents.
+	 *
+	 * It's particularly important that we not wildly overestimate the number
+	 * of descents needed for a clause list with several SAOPs -- the costs
+	 * really aren't multiplicative in the way genericcostestimate expects. In
+	 * general, most distinct combinations of SAOP keys will tend to not find
+	 * any matching tuples.  Furthermore, btree scans search for the next set
+	 * of array keys using the next tuple in line, and so won't even need a
+	 * direct comparison to eliminate most non-matching sets of array keys.
+	 *
+	 * Clamp the number of descents to the estimated number of leaf page
+	 * visits.  This is still fairly pessimistic, but tends to result in more
+	 * accurate costing of scans with several SAOP clauses -- especially when
+	 * each array has more than a few elements.  The cost of adding additional
+	 * array constants to a low-order SAOP column should saturate past a
+	 * certain point (except where selectivity estimates continue to shift).
+	 *
+	 * Also clamp the number of descents to 1/3 the number of index pages.
+	 * This avoids implausibly high estimates with low selectivity paths,
+	 * where scans frequently require no more than one or two descents.
+	 */
+	if (costs.num_sa_scans > 1)
+	{
+		costs.num_sa_scans = Min(costs.num_sa_scans, costs.numIndexPages);
+		costs.num_sa_scans = Min(costs.num_sa_scans, index->pages / 3);
+		costs.num_sa_scans = Max(costs.num_sa_scans, 1);
+	}
+
 	/*
 	 * Add a CPU-cost component to represent the costs of initial btree
 	 * descent.  We don't charge any I/O cost for touching upper btree levels,
@@ -6839,9 +6845,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated
+	 * primitive SA scan.  The ones after the first one are not startup cost
+	 * so far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6858,7 +6864,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * primitive SA scan.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9c4930e9a..a431a7543 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4005,6 +4005,19 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </para>
   </note>
 
+  <note>
+   <para>
+    Every time an index is searched, the index's
+    <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
+    field is incremented.  This usually happens once per index scan node
+    execution, but might take place several times during execution of a scan
+    that searches for multiple values together.  Only queries that use certain
+    <acronym>SQL</acronym> constructs to search for rows matching any value
+    out of a list (or an array) of multiple scalar values are affected.  See
+    <xref linkend="functions-comparisons"/> for details.
+   </para>
+  </note>
+
  </sect2>
 
  <sect2 id="monitoring-pg-statio-all-tables-view">
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4..84c068ae3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1910,7 +1910,7 @@ SELECT count(*) FROM dupindexcols
 (1 row)
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 explain (costs off)
 SELECT unique1 FROM tenk1
@@ -1936,12 +1936,11 @@ explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN                       
--------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1952,18 +1951,35 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Only Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 SET enable_indexonlyscan = OFF;
 explain (costs off)
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
- Sort
-   Sort Key: thousand
-   ->  Index Scan using tenk1_thous_tenthous on tenk1
-         Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
-(4 rows)
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
 
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1974,6 +1990,25 @@ ORDER BY thousand;
         1 |     1001
 (2 rows)
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Index Scan Backward using tenk1_thous_tenthous on tenk1
+   Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+ thousand | tenthous 
+----------+----------
+        1 |     1001
+        0 |     3000
+(2 rows)
+
 RESET enable_indexonlyscan;
 --
 -- Check elimination of constant-NULL subexpressions
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 9b8638f28..20b69ff87 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -7797,10 +7797,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f30..41b955a27 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -753,7 +753,7 @@ SELECT count(*) FROM dupindexcols
   WHERE f1 BETWEEN 'WA' AND 'ZZZ' and id < 1000 and f1 ~<~ 'YX';
 
 --
--- Check ordering of =ANY indexqual results (bug in 9.2.0)
+-- Check that index scans with =ANY indexquals return rows in index order
 --
 
 explain (costs off)
@@ -774,6 +774,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
 SET enable_indexonlyscan = OFF;
 
 explain (costs off)
@@ -785,6 +794,15 @@ SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
 
+explain (costs off)
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
+SELECT thousand, tenthous FROM tenk1
+WHERE thousand < 2 AND tenthous IN (1001,3000)
+ORDER BY thousand DESC, tenthous DESC;
+
 RESET enable_indexonlyscan;
 
 --
-- 
2.40.1

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Reply via email to