Attached is a prototype patch for dynamic prefix truncation that applies on top of v9 of the nbtree patch series [1] I've been working on. This results in considerable performance improvements in some cases, since it's (almost!) safe to skip attributes that we know must be redundant/equal based on our descent of the B-Tree so far -- we can reason about that without any extra comparison work. There are problems, though, which hint at an underlying issue with the broader nbtree design, which is what I really want to talk about. This thread may not go anywhere, but I feel an obligation to put my thoughts on how we do the Lanin & Shasha deletion stuff on the record, available from the archives. I think that there may be some wisdom in the L&S design that was overlooked.
First, I will say a bit more about the merits of the dynamic prefix truncation patch, to establish the practical relevance of what I'm about to go into. I'll show an improvement in bulk insertion performance for the UK land registry data [2]. There is one entry in the table for every property sale in the UK going back 2 or 3 decades -- that's about 21.5 million rows. I have an index defined against a 3GB table, where most cycles are spent during insertions: pg@land:5432 [25978]=# \d composite Unlogged index "public.composite" Column │ Type │ Key? │ Definition ──────────┼──────┼──────┼──────────── county │ text │ yes │ county city │ text │ yes │ city locality │ text │ yes │ locality btree, for table "public.land2" The total index size is 1101 MB after a CREATE INDEX. As you'd imagine, this is a medium to low cardinality index, because there are naturally quite a lot of individual property sales in each locality, especially those in and around London. The query "INSERT INTO land2 SELECT * FROM land_registry_price_paid_uk" does a bulk insertion into the unlogged table land2. Here is the impact on throughput/overall duration: Duration on master branch (best of 3): 01:17.165 Duration with just v9 of my patch (best of 3): 01:12.736 Duration with v9 + dynamic prefix truncation (best of 3): 01:04.038 Clearly dynamic prefix truncation adds quite a lot here -- a serial REINDEX takes about 45 - 50 seconds on the same machine. Note that I'm not cherry-picking my test-case; I've seen similar improvements while bulk loading a TPC-E database in a similar manner. Clearly this is a pretty good optimization, that complements the suffix truncation stuff well. Also, the skip scan patch could make good use of this, since that repeatedly descends a tree with a low cardinality leading attribute. The optimization would be particularly helpful there. Problem ======= I don't think this latest experimental patch can go anywhere for the foreseeable future, even though the idea is workable. There is an _incredibly_ subtle race condition. I want to talk about why this is, since it seems like it's down to a design decision, which likely has broad implications: """ (Note: Lanin and Shasha prefer to make the key space move left, but their argument for doing so hinges on not having left-links, which we have anyway. So we simplify the algorithm by moving key space right.) """ (I'm quoting the nbtree README.) IOW, nbtree page deletion is not the precise opposite of a page split, unlike an implementation that sticks closely to what Lanin & Shasha describe: a concurrent would-be insertion into a deleted page ends up inserting into the deletion target's right sibling, rather than the deletion target's left sibling. This nbtree README sentence seems misleading to me. While the README is technically correct to say L&S don't have left links, they do have something called "outlinks", which seem like a special case of left link to me [3]. Comparing their "procedure move-right()" pseudo-code on page 16 to our own _bt_moveright() routine is informative: you see that they distinguish between an ignorable/half-dead page and a page that just had a concurrent page split in a way that we clearly don't. We only ever 1) stay put, or 2) move right on concurrent split or concurrent deletion. They either 1) stay put, 2) move right when the high key is less than the value being searched for, or 3) follow the "outlink" when the page was marked ignorable (half-dead or dead) by a concurrent deletion. I think that the nbtree README means that nbtree does what you see in L&S's "Figure 3. (b)", despite the fact that Lanin & Shasha find it "inconvenient". L&S also say of Fig 3. (b): """ Let us consider various implementations of merging a node n with its right neighbor n' in the B-link structure. If we move the data in n' to n (Fig. 3a), there is no path that processes can follow from n' to the data moved out. This may, for example, lead a process to conclude that c is not in the structure.' If we move the data in n to n' (Fig. 3b), we will need to access the left neighbor of n to remove n from the linked list. Finding the left neighbor requires either much time or a doubly linked list, which increases complexity and the locking overhead (see [Sh84] for example). """ To be fair, whether or not the outlink is a special case of a leftlink as I suggest, or a separate thing is perhaps debatable -- I'll need to think about backwards scans some more to develop an opinion on that. Either way, having an extra link to handle concurrent page deletions makes a page deletion truly the opposite of a page split. Following a downlink becomes much closer to following a right link, in the sense that we can reason about the next page down/to the right having no key space overlap with the current/about-to-be-previous page. Advantages of making our design closer to Lanin & Shasha's in this way may include: * More elegant design -- moving right to find something that should actually be to the left is arguably pretty ugly. * Being able to reliably notice concurrent page deletions that could affect what our insertion scan key can expect to have to compare next would give me an easy way to make dynamic prefix truncation work correctly. We can safely invalidate the skip attribute hint when an ignorable (half-dead or dead) page is found, starting afresh from the page pointed to by the half-dead page's outlink -- a page deletion that we cannot possibly notice to the immediate left of the child causes no problems. In particular, no concurrent insertion that inserts into a part of the key space dynamic prefix truncation reasoned that it had already eliminated could actually occur. The race condition can be reliably noticed and recovered from as part of handling a concurrent page deletion/following ignorable page's "outlink". * Dynamic prefix truncation is not that different to actual prefix truncation, and is therefore probably going to also block on solving this problem. If we ever want a configurable leaf level compression feature, we'll probably have to confront the same issue. We'd probably end up adding a "low key" at the leaf level when the DBA turned that feature on. How is that supposed to work with the current design? * Merging not-fully-empty nodes seems much easier if we wanted to go that way. The L&S paper is very clear about the fact that it supports merging pages that aren't yet completely empty -- they have no restrictions on the page being totally empty. I'm pretty sure that this is the only reason for nbtree's restriction on page deletion. I'm not actually interested in pursuing this one myself, but I include it to aid understanding of what I'm talking about. I remember Kevin Grittner (CC'd) expressing interest in this project. * We could make amcheck's bt_index_parent_check() SQL-callable function only need an AccessShareLock, perhaps, because concurrent VACUUM/page deletion becomes a problem it can reason about and recover from. The race that makes dynamic prefix truncation unworkable is exactly the same as the race I go on about at length within amcheck's bt_downlink_check() function -- I guess you could say that I noticed this problem a few years ago, but I'm only now connecting the dots. Concurrent VACUUMs could instead be dealt with there by making use of a trick that's similar to the existing cross-page invariant trick that's explained at length within bt_right_page_check_scankey(). The general idea, once again, is that seeing a half-dead child page becomes a reliable canary condition. See also: the enormously detailed comments I wrote within bt_target_page_check() about why this is okay to do this across siblings in the case of bt_index_check(), the SQL-callable verification function that already only needs an AccessShareLock. [1] https://postgr.es/m/CAH2-Wz=apbKyaFhEfRN3UK_yXZ8DSE4Ybr0A3D87=4jwyy1...@mail.gmail.com [2] https://wiki.postgresql.org/wiki/Sample_Databases [3] https://archive.org/stream/symmetricconcurr00lani -- Peter Geoghegan
From 1837d2404ebfadeed824e904ffe4c1588b14a49b Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <pg@bowt.ie> Date: Tue, 27 Nov 2018 16:54:17 -0800 Subject: [PATCH v9 7/7] POC: Add dynamic prefix truncation to nbtree. There is an extremely rare and subtle race condition in this commit, that seems avoidable by rethinking the exact approach taken to implementing page deletion. Note that assert-enabled builds don't trust dynamic prefix truncation, and instead verify that it's correct. That doesn't seem to cause any trouble in practice, though only because the race is remarkably narrow. --- src/backend/access/nbtree/README | 53 +++++++++++++++++++++++ src/backend/access/nbtree/nbtsearch.c | 62 ++++++++++++++++++++++++++- src/backend/access/nbtree/nbtutils.c | 2 + src/include/access/nbtree.h | 11 +++++ 4 files changed, 127 insertions(+), 1 deletion(-) diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index 700b940b79..83112fb61c 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -747,6 +747,59 @@ groups of duplicates align with page boundaries). In any case, getting better locality of access for index scans is more important than getting optimal space utilization. +Notes about dynamic prefix truncation +------------------------------------- + +We don't implement canonical prefix B-Trees from the Bayer & Unterauer +paper (only "simple prefix B-Trees"/suffix truncation), as truncating a +prefix from keys on the leaf level is not compatible with index-only +scans, and would probably negatively impact the performance of selective +index scans. We can nevertheless exploit the fact that there is often +redundancy in prefixes that are common to all possible tuples on a page, +especially at the leaf level. + +Binary searches on internal pages can eliminate earlier attributes from +consideration from their remaining search interval, and from the entire +subtree that the search is about to descend to the root of. Insertion +scan keys store mutable state that remembers which attribute future +comparisons can skip straight to. This remains valid throughout the +entire lifetime of the insertion scan key, without any need to treat +concurrent page splits and page deletions as special cases. This is +called dynamic prefix truncation. + +Recall that an insertion scan key is only used to find the initial leaf +page of an index scan. In general, it can be thought of as a structure +that searches for one exact location equal to or just after some existing +leaf level non-pivot tuple. The only exception is insertions into unique +indexes, where the exact position searched for varies over time to suit +the purposes of unique checking (a scantid is filled-in after uniqueness +is established). That special case is the main reason why we don't +continue to eliminate attributes at leaf level binary searches (we could +be more selective, but that doesn't seem worthwhile -- note that we can +still skip attributes at the leaf level in all cases). The limited and +well-defined purpose of insertion scan keys makes it safe to eliminate +whole attributes from consideration. + +Concurrent page splits cannot invalidate the skip hint, because pivot +tuples in the parent are to the left of values in child pages; the new +high key in the left half of the split must still have equal attributes +before the attribute that we skip straight to, so the decision about +whether or not we move right is still correct. + +Concurrent deletion of a child page cannot invalidate the skip hint, +either. Upon finding an ignorable page we'll move right, and find a page +whose keyspace is _smaller_ than expected. If the page to the right of a +child page is concurrently deleted, we'll keep its original downlink as +the child's high key, so nothing changes. + +XXX: What about the case where the page to the left of a child page we +land on is concurrently deleted? Doesn't that have basically the same +race condition as the one described at length in contrib/amcheck's +bt_downlink_check() function? It seems like there might have been some +wisdom in Lanin & Shasha's idea of making the key space move left rather +than move right during page deletion; that makes page deletion the exact +opposite of a page split, which we don't quite manage. + Notes About Data Representation ------------------------------- diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index c1a483e8d1..105e1f4d87 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -365,7 +365,10 @@ _bt_binsrch(Relation rel, savehigh; int32 result, cmpval; - bool isleaf; + bool isleaf, + saveskipatt; + int lastcomparedattlow; + int lastcomparedatthigh; page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); @@ -430,6 +433,12 @@ _bt_binsrch(Relation rel, cmpval = key->nextkey ? 0 : 1; /* select comparison value */ savehigh = high; + /* Set up state for skipping comparisons of entire attributes */ + key->lastcomparedatt = + lastcomparedattlow = + lastcomparedatthigh = + key->skiptoatt; + saveskipatt = !isleaf && key->heapkeyspace; while (high > low) { OffsetNumber mid = low + ((high - low) / 2); @@ -442,10 +451,14 @@ _bt_binsrch(Relation rel, result = _bt_nonpivot_compare(rel, key, page, mid); if (result >= cmpval) + { low = mid + 1; + lastcomparedattlow = key->lastcomparedatt; + } else { high = mid; + lastcomparedatthigh = key->lastcomparedatt; /* * high can only be reused by more restrictive binary search when @@ -454,6 +467,31 @@ _bt_binsrch(Relation rel, if (result != 0) savehigh = high; } + + if (saveskipatt) + { + int skiptoatt; + + /* + * On an internal page, caller is about to descend a downlink + * between two "separator keys" -- one in the pivot tuple that + * contains the downlink, and the other to its right (may be the + * separator in the high key). Since these separators provide + * bounds on the keyspace of the subtree whose root is the child + * page we're about to descend to, we may be able to safely skip + * comparisons of earlier attributes in that subtree. We can even + * safely eliminate earlier attributes from the remaining search + * interval of the current page. Remember the attribute after the + * last attribute that is definitely equal to the insertion scan + * key in the case of both separators. This should never decrease + * throughout the entire lifetime of any insertion scan key. + * + * This is explained at length in the nbtree README. + */ + skiptoatt = Min(lastcomparedattlow, lastcomparedatthigh); + Assert(key->skiptoatt <= skiptoatt); + key->skiptoatt = skiptoatt; + } } if (key->savebinsrch) @@ -522,7 +560,10 @@ _bt_compare(Relation rel, * still required. */ if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque)) + { + key->lastcomparedatt = key->skiptoatt; return 1; + } itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); ntupatts = BTreeTupleGetNAtts(itup, rel); @@ -591,8 +632,14 @@ _bt_tuple_compare(Relation rel, */ ncmpkey = Min(ntupatts, key->keysz); + /* Don't skip attributes on assert-enabled builds */ +#ifndef USE_ASSERT_CHECKING + scankey = key->scankeys + (key->skiptoatt - 1); + for (i = key->skiptoatt; i <= ncmpkey; i++) +#else scankey = key->scankeys; for (i = 1; i <= ncmpkey; i++) +#endif { Datum datum; bool isNull; @@ -636,6 +683,11 @@ _bt_tuple_compare(Relation rel, INVERT_COMPARE_RESULT(result); } + /* Assert that key->skiptoatt is correct */ + Assert(i >= key->skiptoatt || result == 0); + /* Record for caller that this attribute was compared */ + key->lastcomparedatt = i; + /* if the keys are unequal, return the difference */ if (result != 0) return result; @@ -643,6 +695,12 @@ _bt_tuple_compare(Relation rel, scankey++; } + /* + * Record for caller that comparison is resolved at a truncated/negative + * infinity attribute, or heap TID attribute + */ + key->lastcomparedatt = ncmpkey + 1; + /* * Use the number of attributes as a tie-breaker, in order to treat * truncated attributes in index as minus infinity @@ -1253,6 +1311,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) key.heapkeyspace = _bt_heapkeyspace(rel); key.savebinsrch = key.restorebinsrch = false; key.low = key.high = InvalidOffsetNumber; + key.lastcomparedatt = 1; + key.skiptoatt = 1; key.nextkey = nextkey; key.keysz = keysCount; key.scantid = NULL; diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c index a4964dc22c..3637459902 100644 --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -111,6 +111,8 @@ _bt_mkscankey(Relation rel, IndexTuple itup, bool assumeheapkeyspace) res->heapkeyspace = assumeheapkeyspace || _bt_heapkeyspace(rel); res->savebinsrch = res->restorebinsrch = false; res->low = res->high = InvalidOffsetNumber; + res->lastcomparedatt = 1; + res->skiptoatt = 1; res->nextkey = false; res->keysz = Min(indnkeyatts, tupnatts); res->scantid = res->heapkeyspace ? BTreeTupleGetHeapTID(itup) : NULL; diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index d7fa9e8c49..a2d9675259 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -407,6 +407,17 @@ typedef BTStackData *BTStack; typedef struct BTScanInsertData { + /* + * Mutable state used used by _bt_binsrch() for "dynamic prefix + * truncation", an optimization that allows later _bt_search() comparisons + * to skip earlier attributes that can no longer be unequal to scankey + * values. Used by both _bt_first() and _bt_doinsert(). "skiptoatt" may + * increase when a descent of a tree eliminates additional whole + * attributes from consideration. It can never decrease. + */ + int lastcomparedatt; + int skiptoatt; + /* * Mutable state used by _bt_binsrch() to inexpensively repeat a binary * search on the leaf level when only scantid has changed. Only used for -- 2.17.1