[HACKERS] Performance optimization of btree binary search

Peter Geoghegan Wed, 04 Dec 2013 12:59:47 -0800

Having nothing better to do over the holiday weekend, I decided to
pursue a number of ideas for improving performance that I thought
about a long time ago. These include:


* Pre-fetching list node pointers. This looks to be moderately
promising, but I'm certainly not going to be the one to land it, given
present obligations. Stephen Frost may wish to pick it up, given his
previous interest in the matter. This is slightly controversial,
because it uses a GCC intrinsic (__builtin_prefetch), but also because
the Linux kernel removed this optimization to their generic list
data-structure [1]. However, that list was what we'd call an embedded
list, so we should probably shouldn't be totally deterred. The amount
of effort that I put into this was, frankly, pretty low. A motivated
person, willing to do the appropriate analysis could probably bring it
further. For one thing, every single foreach() has a call to this
intrinsic, even where the list doesn't store pointers (which is not
undefined). At the very least that's going to bloat things up,
frequently for no conceivable gain, and yet with the patch applied
we're still able to see see quite tangible benefits, even if it isn't
exactly a stellar improvement. I have an idea that prefetching the
last element at the start of the loop could be much better than what I
did, because we know that those lists are mostly pretty small in
practice, and that's likely to help pipelining - prefetching too late
or even too early makes the optimization useless, because you may
still get a cache miss.

* Optimizing index scans - I noticed that binary searching accounted
for many cache misses during a pgbench select.sql benchmark,
instrumented with "perf record -e cache-misses". This warranted
further investigation.

I won't say anything further about the former optimization, except to
note that it's included for comparative purposes in the set of
benchmarks I've run (I haven't included a patch). The benchmark
results are here:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/results

I took two approaches to the latter. This was the more interesting
piece of work. Test sets include:

* Master baseline (green)

* List optimization (as mentioned above, not really relevant to the
main topic of this mail) (red)

* "fib btree", earlier patch, please disregard (blue)

* "Fixed fib patch", Fibonacci search, no specialization (purple)

* The interesting one, Finonacci search + specialization - "fib + no
jump"  (turquoise, see below for details)

Initially, I had a little bit of success with Fibonnacci search [2] in
place of binary search, in the hope that it would better take
advantage of CPU cache characteristics - Fibonnacci search is said to
have advantages where non-uniform memory access is an issue - it
minimizes the final interval. I wasn't all that optimistic that it
would work that well given the smallish size of BLCKSZ relative to
modern CPU L1 cache sizes [3], but it did make an appreciable dent on
its own. I suppose old habits die hard, because next I hacked up
_bt_compare and had it do an int4btcmp directly, in the event of
encountering a scankey that had as its comparator the relevant pg_proc
oid. This is very much in the style (and the spirit) of the grotty
early draft patches for the inlining-comparators-for-sorting patch.
Patch is attached. This is a draft, a POC, posted only to facilitate
discussion and to allow my results to be independently
duplicated/verified. Note that there is a bug (attributable to the new
search code) that causes the regression tests to fail in exactly one
place (i.e. one line of difference). I didn't think it was worth
deferring discussion to deal with that, though, since I don't think it
undermines anything.

I'm not sure how valuable the comparator trick is if we stick with
binary search - I didn't test that. I'm sure it's a question that must
be considered, though.

I have a fairly huge amount of data here, having run plenty of
benchmarks over several nights. The short version is that the 'select'
benchmark has just over 18% greater throughput on this machine at some
client counts (in particular, when there are 4 clients - there are 4
cores, but 8 logical cores) with the attached patch. There is a 3.5%
regression with one client, which is certainly not accounted for by
noise. Note, however, that latency appears consistently better with
the patch applied. This is a machine running on dedicated hardware: a
4-core i7-4770. The working set easily fits in its 32GiB of DDR3 RAM
at all pgbench scales tested (1, 10 and 100). The kernel used is
"3.8.0-31-generic #46~precise1-Ubuntu SMP". Postgres settings are
typical for this kind of thing (6GiB shared_buffers), but you can
refer to my pgbench-tools results for full details (drill down to an
individual pgbench run for that - they're all the same). I'm kind of
curious as to what this benchmark would like like on a server with
many more cores.

I guess I could write a proper patch to have code setting up a scankey
also set a flag that indicated that it was acceptable to assume that
the special built-in comparator would do fine. I don't think that
anything as involved as the sortsupport infrastructure is required
here, because that presumed (perhaps questionably) that it was widely
useful to provide more direct comparators, and that an extensible
framework was warranted, whereas what I've done is likely to have very
limited type-applicability. In general, people don't often put indexes
on floating point columns (where we need to handle NaN comparisons in
a special way in order to provide behavior consistent with various
invariants that btree operator classes need to respect). But I'm
reasonably confident that the majority of primary indexes in the wild
are int4 or composites of int4, maybe even the large majority.

I'd be happy with a scheme with only one built-in comparator, and
allowed a few types to be cataloged such that it was indicated that
just using the "built-in" comparator was acceptable, knowledge that
could be passed to _bt_compare via the scankey. I'm thinking of just
int4, and maybe date and a few other such int4 "covariant" types.

I don't think that this has to be morally equivalent to violating the
btree/AM abstraction by giving it direct knowledge of types - rather,
I'm only proposing that we give it the facility to know that for the
purposes of scankey comparison, a given type (well, some scankey that
has been appropriately hinted by the index Relation or something like
that) has Datums that compare like int4 Datums. Theoretically, it has
nothing in particular to do with the cataloged int4 type as such - it
has more to do with the fact that a comparison of 32-bit integers is a
single instruction on popular ISAs. More practically speaking, this
optimization seems likely to appreciably help many common cases.
Obviously, I'm not proposing that it be committed more or less as-is,
since it's entirely possible that the applicability of what I've done
isn't universal; I don't presume that it's never counter-productive.
How this could be most usefully applied is a discussion well worth
having, though. To be frank: hopefully we won't have a protracted
discussion this time. This stuff is not a high priority for me these
days.

Having mentioned CPU instructions, I should also say: as with the
sorting stuff, this optimization has little to nothing to do with
shaving some instructions from not having to go through the fmgr
trampoline. The true cost for that trampoline is almost certainly paid
in cache misses and, perhaps to a lesser extent, missed opportunities
for successful branch prediction.

Thoughts?

[1] http://lwn.net/Articles/444336/

[2] http://www.mpri.lsu.edu/textbook/Chapter5-a.htm#fibonacci

[3] 
http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/
-- 
Peter Geoghegan

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
new file mode 100644
index a452fea..8f92dfc
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
*************** top:
*** 157,163 ****
  	{
  		TransactionId xwait;
  
! 		offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
  		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
  								 checkUnique, &is_unique);
  
--- 157,163 ----
  	{
  		TransactionId xwait;
  
! 		offset = _bt_page_srch(rel, buf, natts, itup_scankey, false);
  		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
  								 checkUnique, &is_unique);
  
*************** _bt_findinsertloc(Relation rel,
*** 639,645 ****
  	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
  		newitemoff = firstlegaloff;
  	else
! 		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
  
  	*bufptr = buf;
  	*offsetptr = newitemoff;
--- 639,645 ----
  	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
  		newitemoff = firstlegaloff;
  	else
! 		newitemoff = _bt_page_srch(rel, buf, keysz, scankey, false);
  
  	*bufptr = buf;
  	*offsetptr = newitemoff;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
new file mode 100644
index ac98589..391b055
*** a/src/backend/access/nbtree/nbtsearch.c
--- b/src/backend/access/nbtree/nbtsearch.c
***************
*** 24,29 ****
--- 24,32 ----
  #include "utils/rel.h"
  
  
+ static int _bt_getfibnumber(OffsetNumber pagehigh);
+ static int _bt_fib_srch(Relation rel, int keysz, ScanKey scankey, Page page,
+ 						int low, int high);
  static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
  			 OffsetNumber offnum);
  static void _bt_saveitem(BTScanOpaque so, int itemIndex,
*************** static bool _bt_endpoint(IndexScanDesc s
*** 34,39 ****
--- 37,53 ----
  
  
  /*
+  * Precomputed table of Fibonacci numbers, used as search pivot offsets.
+  *
+  * There are enough constants here to cover all possible values of BLCKSZ.
+  */
+ static const OffsetNumber fibonacci[] = {
+ 	 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144,
+ 	 233, 377, 610, 987, 1597, 2584, 4181, 6765,
+ 	 10946, 17711, 28657, 46368, USHRT_MAX
+ };
+ 
+ /*
   *	_bt_search() -- Search the tree for a particular scankey,
   *		or more precisely for the first leaf page it could be on.
   *
*************** _bt_search(Relation rel, int keysz, Scan
*** 95,101 ****
  		 * Find the appropriate item on the internal page, and get the child
  		 * page that it points to.
  		 */
! 		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
  		itemid = PageGetItemId(page, offnum);
  		itup = (IndexTuple) PageGetItem(page, itemid);
  		blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
--- 109,115 ----
  		 * Find the appropriate item on the internal page, and get the child
  		 * page that it points to.
  		 */
! 		offnum = _bt_page_srch(rel, *bufP, keysz, scankey, nextkey);
  		itemid = PageGetItemId(page, offnum);
  		itup = (IndexTuple) PageGetItem(page, itemid);
  		blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
*************** _bt_moveright(Relation rel,
*** 204,210 ****
  }
  
  /*
!  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
   *
   * The passed scankey must be an insertion-type scankey (see nbtree/README),
   * but it can omit the rightmost column(s) of the index.
--- 218,224 ----
  }
  
  /*
!  *	_bt_page_srch() -- Do a search for a key on a particular page.
   *
   * The passed scankey must be an insertion-type scankey (see nbtree/README),
   * but it can omit the rightmost column(s) of the index.
*************** _bt_moveright(Relation rel,
*** 213,224 ****
   * item >= scankey.  When nextkey is true, we are looking for the first
   * item strictly greater than scankey.
   *
!  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
   * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
   * particular, this means it is possible to return a value 1 greater than the
   * number of keys on the page, if the scankey is > all keys on the page.)
   *
!  * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
   * of the last key < given scankey, or last key <= given scankey if nextkey
   * is true.  (Since _bt_compare treats the first data key of such a page as
   * minus infinity, there will be at least one key < scankey, so the result
--- 227,238 ----
   * item >= scankey.  When nextkey is true, we are looking for the first
   * item strictly greater than scankey.
   *
!  * On a leaf page, _bt_page_srch() returns the OffsetNumber of the first
   * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
   * particular, this means it is possible to return a value 1 greater than the
   * number of keys on the page, if the scankey is > all keys on the page.)
   *
!  * On an internal (non-leaf) page, _bt_page_srch() returns the OffsetNumber
   * of the last key < given scankey, or last key <= given scankey if nextkey
   * is true.  (Since _bt_compare treats the first data key of such a page as
   * minus infinity, there will be at least one key < scankey, so the result
*************** _bt_moveright(Relation rel,
*** 227,237 ****
   * (or leaf keys > given scankey when nextkey is true).
   *
   * This procedure is not responsible for walking right, it just examines
!  * the given page.	_bt_binsrch() has no lock or refcount side effects
   * on the buffer.
   */
  OffsetNumber
! _bt_binsrch(Relation rel,
  			Buffer buf,
  			int keysz,
  			ScanKey scankey,
--- 241,251 ----
   * (or leaf keys > given scankey when nextkey is true).
   *
   * This procedure is not responsible for walking right, it just examines
!  * the given page.	_bt_page_srch() has no lock or refcount side effects
   * on the buffer.
   */
  OffsetNumber
! _bt_page_srch(Relation rel,
  			Buffer buf,
  			int keysz,
  			ScanKey scankey,
*************** _bt_binsrch(Relation rel,
*** 274,293 ****
  	 */
  	high++;						/* establish the loop invariant for high */
  
! 	cmpval = nextkey ? 0 : 1;	/* select comparison value */
! 
! 	while (high > low)
  	{
! 		OffsetNumber mid = low + ((high - low) / 2);
  
! 		/* We have low <= mid < high, so mid points at a real slot */
  
! 		result = _bt_compare(rel, keysz, scankey, page, mid);
  
! 		if (result >= cmpval)
! 			low = mid + 1;
! 		else
! 			high = mid;
  	}
  
  	/*
--- 288,314 ----
  	 */
  	high++;						/* establish the loop invariant for high */
  
! 	if (nextkey)
  	{
! 		cmpval = nextkey ? 0 : 1;	/* select comparison value */
  
! 		while (high > low)
! 		{
! 			OffsetNumber mid = low + ((high - low) / 2);
  
! 			/* We have low <= mid < high, so mid points at a real slot */
  
! 			result = _bt_compare(rel, keysz, scankey, page, mid);
! 
! 			if (result >= cmpval) /* for !nextKey, if result > 0 */
! 				low = mid + 1;
! 			else
! 				high = mid;
! 		}
! 	}
! 	else
! 	{
! 		low = _bt_fib_srch(rel, keysz, scankey, page, low, high);
  	}
  
  	/*
*************** _bt_compare(Relation rel,
*** 395,412 ****
  		}
  		else
  		{
! 			/*
! 			 * The sk_func needs to be passed the index value as left arg and
! 			 * the sk_argument as right arg (they might be of different
! 			 * types).	Since it is convenient for callers to think of
! 			 * _bt_compare as comparing the scankey to the index item, we have
! 			 * to flip the sign of the comparison result.  (Unless it's a DESC
! 			 * column, in which case we *don't* flip the sign.)
! 			 */
! 			result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
! 													 scankey->sk_collation,
! 													 datum,
! 													 scankey->sk_argument));
  
  			if (!(scankey->sk_flags & SK_BT_DESC))
  				result = -result;
--- 416,448 ----
  		}
  		else
  		{
! 			if (scankey->sk_func.fn_oid == 351)
! 			{
! 				int32		a = DatumGetInt32(datum);
! 				int32		b = DatumGetInt32(scankey->sk_argument);
! 
! 				if (a > b)
! 					result = Int32GetDatum(1);
! 				else if (a == b)
! 					result = Int32GetDatum(0);
! 				else
! 					result = Int32GetDatum(-1);
! 			}
! 			else
! 			{
! 				/*
! 				 * The sk_func needs to be passed the index value as left arg and
! 				 * the sk_argument as right arg (they might be of different
! 				 * types).	Since it is convenient for callers to think of
! 				 * _bt_compare as comparing the scankey to the index item, we have
! 				 * to flip the sign of the comparison result.  (Unless it's a DESC
! 				 * column, in which case we *don't* flip the sign.)
! 				 */
! 				result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
! 														 scankey->sk_collation,
! 														 datum,
! 														 scankey->sk_argument));
! 			}
  
  			if (!(scankey->sk_flags & SK_BT_DESC))
  				result = -result;
*************** _bt_first(IndexScanDesc scan, ScanDirect
*** 803,809 ****
  	 * where we need to start the scan, and set flag variables to control the
  	 * code below.
  	 *
! 	 * If nextkey = false, _bt_search and _bt_binsrch will locate the first
  	 * item >= scan key.  If nextkey = true, they will locate the first
  	 * item > scan key.
  	 *
--- 839,845 ----
  	 * where we need to start the scan, and set flag variables to control the
  	 * code below.
  	 *
! 	 * If nextkey = false, _bt_search and _bt_page_srch will locate the first
  	 * item >= scan key.  If nextkey = true, they will locate the first
  	 * item > scan key.
  	 *
*************** _bt_first(IndexScanDesc scan, ScanDirect
*** 929,935 ****
  	so->markItemIndex = -1;		/* ditto */
  
  	/* position to the precise item on the page */
! 	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
  
  	/*
  	 * If nextkey = false, we are positioned at the first item >= scan key, or
--- 965,971 ----
  	so->markItemIndex = -1;		/* ditto */
  
  	/* position to the precise item on the page */
! 	offnum = _bt_page_srch(rel, buf, keysCount, scankeys, nextkey);
  
  	/*
  	 * If nextkey = false, we are positioned at the first item >= scan key, or
*************** _bt_next(IndexScanDesc scan, ScanDirecti
*** 1038,1043 ****
--- 1074,1207 ----
  }
  
  /*
+  * Returns offset to smallest Fibonacci number greater than or equal to a max
+  * offset value. (e.g. if 12 is passed, 7 is returned, which can be subscripted
+  * to obtain the 8th number in the sequence, 13).
+  */
+ static int
+ _bt_getfibnumber(OffsetNumber pagehigh)
+ {
+ 	static const int nfibs = sizeof(fibonacci) / sizeof(OffsetNumber);
+ 
+ 	int low, mid, high, geq;
+ 
+ 	low = 0;
+ 	high = nfibs - 1;
+ 	geq = -1;
+ 
+ 	if (pagehigh <= 1)
+ 		return 1;
+ 
+ 	/* Perform simple binary search */
+ 	while (low <= high)
+ 	{
+ 		mid = (low + high) / 2;
+ 
+ 		if (pagehigh < fibonacci[mid])
+ 		{
+ 			high = mid - 1;
+ 			geq = mid;
+ 		}
+ 		else if (pagehigh > fibonacci[mid])
+ 			low = mid + 1;
+ 		else
+ 			return mid;
+ 	}
+ 
+ 	if (geq == -1)
+ 		elog(ERROR, "could not find fibonacci number for offset %d", pagehigh);
+ 
+ 	/* Return smallest index greater than or equal to pagehigh */
+ 	return geq;
+ }
+ 
+ /*
+  * Fibonacci search for a key on a particular page.
+  *
+  * This is often preferred over the binary search otherwise performed by
+  * _bt_page_srch, because it is thought to take advantage of CPU cache
+  * characteristics.  However, it is only safe to call this function for the
+  * nextkey case (i.e. when searching for the first item >= scankey).
+  */
+ static int
+ _bt_fib_srch(Relation rel, int keysz, ScanKey scankey, Page page,
+ 			 int low, int high)
+ {
+ 	int32					result;
+ 	int						k;				/* Kth fib number < high */
+ 	static int				prevk = -1;		/* Previous k, cached */
+ 	static int				prevhigh = -1;	/* Previous val, cached for re-use */
+ 	int						idx = 0, offs = 0, last;
+ 
+ 	/*
+ 	 * Find value cached in static memory, or search for value and cache it
+ 	 * there for his high.
+ 	 */
+ 	if (high != prevhigh)
+ 	{
+ 		prevk = k = _bt_getfibnumber(high);
+ 		prevhigh = high;
+ 	}
+ 	else
+ 	{
+ 		high = prevhigh;
+ 		k = prevk;
+ 		Assert(k >= 1);
+ 	}
+ 
+ 	do
+ 	{
+ 		/* Lose the greater than "high" value in first iteration */
+ 		idx = offs + fibonacci[--k];
+ 
+ 		Assert(fibonacci[k] < high);
+ 		Assert(low < high);
+ 
+ 		if (idx < low)
+ 		{
+ 			/* val in 2nd part */
+ 			offs = low;
+ 			--k;
+ 			continue;
+ 		}
+ 		else if (idx >= high) /* Went too far */
+ 			continue;
+ 
+ 		result = _bt_compare(rel, keysz, scankey, page, idx);
+ 
+ 		/* Note that at this point k has already been decremented once */
+ 		if (result < 0)
+ 		{
+ 			continue;
+ 		}
+ 		else if (result > 0)
+ 		{
+ 			/* val in 2nd part */
+ 			offs = idx;
+ 			--k;
+ 		}
+ 		else
+ 		{
+ 			/* Found, but may not be the first such key */
+ 			while (result == 0)
+ 			{
+ 				/* TODO: Think harder about many duplicates case */
+ 				last = idx;
+ 				if (idx <= low)
+ 					return low;
+ 				result = _bt_compare(rel, keysz, scankey, page, --idx);
+ 			}
+ 
+ 			return last;
+ 		}
+ 	}
+ 	while (k > 0);
+ 
+ 	/* Not found */
+ 	return idx + 1;
+ }
+ 
+ /*
   *	_bt_readpage() -- Load data from current index page into so->currPos
   *
   * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
new file mode 100644
index 3997f94..dc36346
*** a/src/include/access/nbtree.h
--- b/src/include/access/nbtree.h
*************** extern BTStack _bt_search(Relation rel,
*** 649,655 ****
  		   Buffer *bufP, int access);
  extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
  			  ScanKey scankey, bool nextkey, int access);
! extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
  			ScanKey scankey, bool nextkey);
  extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
  			Page page, OffsetNumber offnum);
--- 649,655 ----
  		   Buffer *bufP, int access);
  extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
  			  ScanKey scankey, bool nextkey, int access);
! extern OffsetNumber _bt_page_srch(Relation rel, Buffer buf, int keysz,
  			ScanKey scankey, bool nextkey);
  extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
  			Page page, OffsetNumber offnum);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Performance optimization of btree binary search

Reply via email to