Dynamic prefix truncation for nbtree, Lanin & Shasha design issue

Peter Geoghegan Mon, 03 Dec 2018 19:12:02 -0800

Attached is a prototype patch for dynamic prefix truncation that
applies on top of v9 of the nbtree patch series [1] I've been working
on. This results in considerable performance improvements in some
cases, since it's (almost!) safe to skip attributes that we know must
be redundant/equal based on our descent of the B-Tree so far -- we can
reason about that without any extra comparison work. There are
problems, though, which hint at an underlying issue with the broader
nbtree design, which is what I really want to talk about. This thread
may not go anywhere, but I feel an obligation to put my thoughts on
how we do the Lanin & Shasha deletion stuff on the record, available
from the archives. I think that there may be some wisdom in the L&S
design that was overlooked.

First, I will say a bit more about the merits of the dynamic prefix
truncation patch, to establish the practical relevance of what I'm
about to go into. I'll show an improvement in bulk insertion
performance for the UK land registry data [2]. There is one entry in
the table for every property sale in the UK going back 2 or 3 decades
-- that's about 21.5 million rows. I have an index defined against a
3GB table, where most cycles are spent during insertions:

pg@land:5432 [25978]=# \d composite
Unlogged index "public.composite"
Column │ Type │ Key? │ Definition
──────────┼──────┼──────┼────────────
county │ text │ yes │ county
city │ text │ yes │ city
locality │ text │ yes │ locality
btree, for table "public.land2"

The total index size is 1101 MB after a CREATE INDEX. As you'd
imagine, this is a medium to low cardinality index, because there are
naturally quite a lot of individual property sales in each locality,
especially those in and around London. The query "INSERT INTO land2
SELECT * FROM land_registry_price_paid_uk" does a bulk insertion into
the unlogged table land2. Here is the impact on throughput/overall
duration:

Duration on master branch (best of 3): 01:17.165

Duration with just v9 of my patch (best of 3): 01:12.736

Duration with v9 + dynamic prefix truncation (best of 3): 01:04.038

Clearly dynamic prefix truncation adds quite a lot here -- a serial
REINDEX takes about 45 - 50 seconds on the same machine. Note that I'm
not cherry-picking my test-case; I've seen similar improvements while
bulk loading a TPC-E database in a similar manner. Clearly this is a
pretty good optimization, that complements the suffix truncation stuff
well. Also, the skip scan patch could make good use of this, since
that repeatedly descends a tree with a low cardinality leading
attribute. The optimization would be particularly helpful there.

Problem
=======

I don't think this latest experimental patch can go anywhere for the
foreseeable future, even though the idea is workable. There is an
_incredibly_ subtle race condition. I want to talk about why this is,
since it seems like it's down to a design decision, which likely has
broad implications:

"""
(Note: Lanin and Shasha prefer to make the key space move left, but their
argument for doing so hinges on not having left-links, which we have
anyway. So we simplify the algorithm by moving key space right.)
"""

(I'm quoting the nbtree README.)

IOW, nbtree page deletion is not the precise opposite of a page split,
unlike an implementation that sticks closely to what Lanin & Shasha
describe: a concurrent would-be insertion into a deleted page ends up
inserting into the deletion target's right sibling, rather than the
deletion target's left sibling. This nbtree README sentence seems
misleading to me. While the README is technically correct to say L&S
don't have left links, they do have something called "outlinks", which
seem like a special case of left link to me [3].

Comparing their "procedure move-right()" pseudo-code on page 16 to our
own _bt_moveright() routine is informative: you see that they
distinguish between an ignorable/half-dead page and a page that just
had a concurrent page split in a way that we clearly don't. We only
ever 1) stay put, or 2) move right on concurrent split or concurrent
deletion. They either 1) stay put, 2) move right when the high key is
less than the value being searched for, or 3) follow the "outlink"
when the page was marked ignorable (half-dead or dead) by a concurrent
deletion.

I think that the nbtree README means that nbtree does what you see in
L&S's "Figure 3. (b)", despite the fact that Lanin & Shasha find it
"inconvenient". L&S also say of Fig 3. (b):

"""
Let us consider various implementations of merging a node n with its
right neighbor n' in the B-link structure. If we move the data in n'
to n (Fig. 3a), there is no path that processes can follow from n' to
the data moved out. This may, for example, lead a process to conclude
that c is not in the structure.' If we move the data in n to n' (Fig.
3b), we will need to access the left neighbor of n to remove n from
the linked list. Finding the left neighbor requires either much time
or a doubly linked list, which increases complexity and the locking
overhead (see [Sh84] for example).
"""

To be fair, whether or not the outlink is a special case of a leftlink
as I suggest, or a separate thing is perhaps debatable -- I'll need to
think about backwards scans some more to develop an opinion on that.
Either way, having an extra link to handle concurrent page deletions
makes a page deletion truly the opposite of a page split. Following a
downlink becomes much closer to following a right link, in the sense
that we can reason about the next page down/to the right having no key
space overlap with the current/about-to-be-previous page.

Advantages of making our design closer to Lanin & Shasha's in this way
may include:

* More elegant design -- moving right to find something that should
actually be to the left is arguably pretty ugly.

* Being able to reliably notice concurrent page deletions that could
affect what our insertion scan key can expect to have to compare next
would give me an easy way to make dynamic prefix truncation work
correctly.

We can safely invalidate the skip attribute hint when an ignorable
(half-dead or dead) page is found, starting afresh from the page
pointed to by the half-dead page's outlink -- a page deletion that we
cannot possibly notice to the immediate left of the child causes no
problems. In particular, no concurrent insertion that inserts into a
part of the key space dynamic prefix truncation reasoned that it had
already eliminated could actually occur. The race condition can be
reliably noticed and recovered from as part of handling a concurrent
page deletion/following ignorable page's "outlink".

* Dynamic prefix truncation is not that different to actual prefix
truncation, and is therefore probably going to also block on solving
this problem.

If we ever want a configurable leaf level compression feature, we'll
probably have to confront the same issue. We'd probably end up adding
a "low key" at the leaf level when the DBA turned that feature on. How
is that supposed to work with the current design?

* Merging not-fully-empty nodes seems much easier if we wanted to go that way.

The L&S paper is very clear about the fact that it supports merging
pages that aren't yet completely empty -- they have no restrictions on
the page being totally empty. I'm pretty sure that this is the only
reason for nbtree's restriction on page deletion. I'm not actually
interested in pursuing this one myself, but I include it to aid
understanding of what I'm talking about. I remember Kevin Grittner
(CC'd) expressing interest in this project.

* We could make amcheck's bt_index_parent_check() SQL-callable
function only need an AccessShareLock, perhaps, because concurrent
VACUUM/page deletion becomes a problem it can reason about and recover
from.

The race that makes dynamic prefix truncation unworkable is exactly
the same as the race I go on about at length within amcheck's
bt_downlink_check() function -- I guess you could say that I noticed
this problem a few years ago, but I'm only now connecting the dots.
Concurrent VACUUMs could instead be dealt with there by making use of
a trick that's similar to the existing cross-page invariant trick
that's explained at length within bt_right_page_check_scankey(). The
general idea, once again, is that seeing a half-dead child page
becomes a reliable canary condition. See also: the enormously detailed
comments I wrote within bt_target_page_check() about why this is okay
to do this across siblings in the case of bt_index_check(), the
SQL-callable verification function that already only needs an
AccessShareLock.

[1]
https://postgr.es/m/CAH2-Wz=apbKyaFhEfRN3UK_yXZ8DSE4Ybr0A3D87=4jwyy1...@mail.gmail.com
[2] https://wiki.postgresql.org/wiki/Sample_Databases
[3] https://archive.org/stream/symmetricconcurr00lani
--
Peter Geoghegan

From 1837d2404ebfadeed824e904ffe4c1588b14a49b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 27 Nov 2018 16:54:17 -0800
Subject: [PATCH v9 7/7] POC: Add dynamic prefix truncation to nbtree.

There is an extremely rare and subtle race condition in this commit,
that seems avoidable by rethinking the exact approach taken to
implementing page deletion.  Note that assert-enabled builds don't trust
dynamic prefix truncation, and instead verify that it's correct.  That
doesn't seem to cause any trouble in practice, though only because the
race is remarkably narrow.
---
 src/backend/access/nbtree/README      | 53 +++++++++++++++++++++++
 src/backend/access/nbtree/nbtsearch.c | 62 ++++++++++++++++++++++++++-
 src/backend/access/nbtree/nbtutils.c  |  2 +
 src/include/access/nbtree.h           | 11 +++++
 4 files changed, 127 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 700b940b79..83112fb61c 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -747,6 +747,59 @@ groups of duplicates align with page boundaries).  In any case, getting
 better locality of access for index scans is more important than getting
 optimal space utilization.
 
+Notes about dynamic prefix truncation
+-------------------------------------
+
+We don't implement canonical prefix B-Trees from the Bayer & Unterauer
+paper (only "simple prefix B-Trees"/suffix truncation), as truncating a
+prefix from keys on the leaf level is not compatible with index-only
+scans, and would probably negatively impact the performance of selective
+index scans.  We can nevertheless exploit the fact that there is often
+redundancy in prefixes that are common to all possible tuples on a page,
+especially at the leaf level.
+
+Binary searches on internal pages can eliminate earlier attributes from
+consideration from their remaining search interval, and from the entire
+subtree that the search is about to descend to the root of.  Insertion
+scan keys store mutable state that remembers which attribute future
+comparisons can skip straight to.  This remains valid throughout the
+entire lifetime of the insertion scan key, without any need to treat
+concurrent page splits and page deletions as special cases.  This is
+called dynamic prefix truncation.
+
+Recall that an insertion scan key is only used to find the initial leaf
+page of an index scan.  In general, it can be thought of as a structure
+that searches for one exact location equal to or just after some existing
+leaf level non-pivot tuple.  The only exception is insertions into unique
+indexes, where the exact position searched for varies over time to suit
+the purposes of unique checking (a scantid is filled-in after uniqueness
+is established).  That special case is the main reason why we don't
+continue to eliminate attributes at leaf level binary searches (we could
+be more selective, but that doesn't seem worthwhile -- note that we can
+still skip attributes at the leaf level in all cases).   The limited and
+well-defined purpose of insertion scan keys makes it safe to eliminate
+whole attributes from consideration.
+
+Concurrent page splits cannot invalidate the skip hint, because pivot
+tuples in the parent are to the left of values in child pages; the new
+high key in the left half of the split must still have equal attributes
+before the attribute that we skip straight to, so the decision about
+whether or not we move right is still correct.
+
+Concurrent deletion of a child page cannot invalidate the skip hint,
+either.  Upon finding an ignorable page we'll move right, and find a page
+whose keyspace is _smaller_ than expected.  If the page to the right of a
+child page is concurrently deleted, we'll keep its original downlink as
+the child's high key, so nothing changes.
+
+XXX: What about the case where the page to the left of a child page we
+land on is concurrently deleted?  Doesn't that have basically the same
+race condition as the one described at length in contrib/amcheck's
+bt_downlink_check() function? It seems like there might have been some
+wisdom in Lanin & Shasha's idea of making the key space move left rather
+than move right during page deletion; that makes page deletion the exact
+opposite of a page split, which we don't quite manage.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c1a483e8d1..105e1f4d87 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -365,7 +365,10 @@ _bt_binsrch(Relation rel,
 				savehigh;
 	int32		result,
 				cmpval;
-	bool		isleaf;
+	bool		isleaf,
+				saveskipatt;
+	int			lastcomparedattlow;
+	int			lastcomparedatthigh;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -430,6 +433,12 @@ _bt_binsrch(Relation rel,
 
 	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 	savehigh = high;
+	/* Set up state for skipping comparisons of entire attributes */
+	key->lastcomparedatt =
+		lastcomparedattlow =
+		lastcomparedatthigh =
+		key->skiptoatt;
+	saveskipatt = !isleaf && key->heapkeyspace;
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
@@ -442,10 +451,14 @@ _bt_binsrch(Relation rel,
 			result = _bt_nonpivot_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
+		{
 			low = mid + 1;
+			lastcomparedattlow = key->lastcomparedatt;
+		}
 		else
 		{
 			high = mid;
+			lastcomparedatthigh = key->lastcomparedatt;
 
 			/*
 			 * high can only be reused by more restrictive binary search when
@@ -454,6 +467,31 @@ _bt_binsrch(Relation rel,
 			if (result != 0)
 				savehigh = high;
 		}
+
+		if (saveskipatt)
+		{
+			int			skiptoatt;
+
+			/*
+			 * On an internal page, caller is about to descend a downlink
+			 * between two "separator keys" -- one in the pivot tuple that
+			 * contains the downlink, and the other to its right (may be the
+			 * separator in the high key).  Since these separators provide
+			 * bounds on the keyspace of the subtree whose root is the child
+			 * page we're about to descend to, we may be able to safely skip
+			 * comparisons of earlier attributes in that subtree.  We can even
+			 * safely eliminate earlier attributes from the remaining search
+			 * interval of the current page.  Remember the attribute after the
+			 * last attribute that is definitely equal to the insertion scan
+			 * key in the case of both separators.  This should never decrease
+			 * throughout the entire lifetime of any insertion scan key.
+			 *
+			 * This is explained at length in the nbtree README.
+			 */
+			skiptoatt = Min(lastcomparedattlow, lastcomparedatthigh);
+			Assert(key->skiptoatt <= skiptoatt);
+			key->skiptoatt = skiptoatt;
+		}
 	}
 
 	if (key->savebinsrch)
@@ -522,7 +560,10 @@ _bt_compare(Relation rel,
 	 * still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+	{
+		key->lastcomparedatt = key->skiptoatt;
 		return 1;
+	}
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	ntupatts = BTreeTupleGetNAtts(itup, rel);
@@ -591,8 +632,14 @@ _bt_tuple_compare(Relation rel,
 	 */
 
 	ncmpkey = Min(ntupatts, key->keysz);
+	/* Don't skip attributes on assert-enabled builds */
+#ifndef USE_ASSERT_CHECKING
+	scankey = key->scankeys + (key->skiptoatt - 1);
+	for (i = key->skiptoatt; i <= ncmpkey; i++)
+#else
 	scankey = key->scankeys;
 	for (i = 1; i <= ncmpkey; i++)
+#endif
 	{
 		Datum		datum;
 		bool		isNull;
@@ -636,6 +683,11 @@ _bt_tuple_compare(Relation rel,
 				INVERT_COMPARE_RESULT(result);
 		}
 
+		/* Assert that key->skiptoatt is correct */
+		Assert(i >= key->skiptoatt || result == 0);
+		/* Record for caller that this attribute was compared */
+		key->lastcomparedatt = i;
+
 		/* if the keys are unequal, return the difference */
 		if (result != 0)
 			return result;
@@ -643,6 +695,12 @@ _bt_tuple_compare(Relation rel,
 		scankey++;
 	}
 
+	/*
+	 * Record for caller that comparison is resolved at a truncated/negative
+	 * infinity attribute, or heap TID attribute
+	 */
+	key->lastcomparedatt = ncmpkey + 1;
+
 	/*
 	 * Use the number of attributes as a tie-breaker, in order to treat
 	 * truncated attributes in index as minus infinity
@@ -1253,6 +1311,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	key.heapkeyspace = _bt_heapkeyspace(rel);
 	key.savebinsrch = key.restorebinsrch = false;
 	key.low = key.high = InvalidOffsetNumber;
+	key.lastcomparedatt = 1;
+	key.skiptoatt = 1;
 	key.nextkey = nextkey;
 	key.keysz = keysCount;
 	key.scantid = NULL;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index a4964dc22c..3637459902 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,6 +111,8 @@ _bt_mkscankey(Relation rel, IndexTuple itup, bool assumeheapkeyspace)
 	res->heapkeyspace = assumeheapkeyspace || _bt_heapkeyspace(rel);
 	res->savebinsrch = res->restorebinsrch = false;
 	res->low = res->high = InvalidOffsetNumber;
+	res->lastcomparedatt = 1;
+	res->skiptoatt = 1;
 	res->nextkey = false;
 	res->keysz = Min(indnkeyatts, tupnatts);
 	res->scantid = res->heapkeyspace ? BTreeTupleGetHeapTID(itup) : NULL;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d7fa9e8c49..a2d9675259 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -407,6 +407,17 @@ typedef BTStackData *BTStack;
 
 typedef struct BTScanInsertData
 {
+	/*
+	 * Mutable state used used by _bt_binsrch() for "dynamic prefix
+	 * truncation", an optimization that allows later _bt_search() comparisons
+	 * to skip earlier attributes that can no longer be unequal to scankey
+	 * values.  Used by both _bt_first() and _bt_doinsert().  "skiptoatt" may
+	 * increase when a descent of a tree eliminates additional whole
+	 * attributes from consideration.  It can never decrease.
+	 */
+	int			 lastcomparedatt;
+	int			 skiptoatt;
+
 	/*
 	 * Mutable state used by _bt_binsrch() to inexpensively repeat a binary
 	 * search on the leaf level when only scantid has changed.  Only used for
-- 
2.17.1

Dynamic prefix truncation for nbtree, Lanin & Shasha design issue

Reply via email to