Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Heikki Linnakangas Fri, 28 Dec 2018 10:05:10 -0800

On 04/12/2018 05:10, Peter Geoghegan wrote:

Attached is v9, ...

I spent some time reviewing this. I skipped the first patch, to add acolumn to pg_depend, and I got through patches 2, 3 and 4. Impressiveresults, and the code looks sane.

I wrote a laundry list of little comments on minor things, suggestedrewordings of comments etc. I hope they're useful, but feel free toignore/override my opinions of any of those, as you see best.


But first, a few slightly bigger (medium-sized?) issues that caught my eye:

1. How about doing the BTScanInsertData refactoring as a separatecommit, first? It seems like a good thing for readability on its own,and would slim the big main patch. (And make sure to credit Andrey forthat idea in the commit message.)



2. In the "Treat heap TID as part of the nbtree key space" patch:

  *             Build an insertion scan key that contains comparison data from 
itup
  *             as well as comparator routines appropriate to the key datatypes.
  *
+ *             When itup is a non-pivot tuple, the returned insertion scan key 
is
+ *             suitable for finding a place for it to go on the leaf level.  
When
+ *             itup is a pivot tuple, the returned insertion scankey is 
suitable
+ *             for locating the leaf page with the pivot as its high key (there
+ *             must have been one like it at some point if the pivot tuple
+ *             actually came from the tree).
+ *
+ *             Note that we may occasionally have to share lock the metapage, 
in
+ *             order to determine whether or not the keys in the index are 
expected
+ *             to be unique (i.e. whether or not heap TID is treated as a 
tie-breaker
+ *             attribute).  Callers that cannot tolerate this can request that 
we
+ *             assume that this is a heapkeyspace index.
+ *
  *             The result is intended for use with _bt_compare().
  */
-ScanKey
-_bt_mkscankey(Relation rel, IndexTuple itup)
+BTScanInsert
+_bt_mkscankey(Relation rel, IndexTuple itup, bool assumeheapkeyspace)

This 'assumeheapkeyspace' flag feels awkward. What if the caller knowsthat it is a v3 index? There's no way to tell _bt_mkscankey() that.(There's no need for that, currently, but seems a bit weird.)

_bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), whichcalls _bt_mkscankey(). It's holding a lock on the page being split. Dowe risk deadlock by locking the metapage at the same time?

I don't have any great ideas on what to do about this, but it's awkwardas it is. Can we get away without the new argument? Could we somehowarrange things so that rd_amcache would be guaranteed to already be set?



3. In the "Pick nbtree split points discerningly" patch

I find the different modes and the logic in _bt_findsplitloc() very hardto understand. I've spent a while looking at it now, and I think I havea vague understanding of what things it takes into consideration, but Idon't understand why it performs those multiple stages, what each stagedoes, and how that leads to an overall strategy. I think a rewrite wouldbe in order, to make that more understandable. I'm not sure what exactlyit should look like, though.

If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES orSINGLE_VALUE modes, it has to redo a lot of the work that was done inthe DEFAULT mode already. That's probably not a big deal in practice,performance-wise, but I feel that it's another hint that somerefactoring would be in order.


One idea on how to restructure that:

Make a single pass over all the offset numbers, considering a split atthat location. Like the current code does. For each offset, calculate a"penalty" based on two factors:


* free space on each side

* the number of attributes in the pivot tuple, and whether it needs tostore the heap TID

Define the penalty function so that having to add a heap TID to thepivot tuple is considered very expensive, more expensive than anythingelse, and truncating away other attributes gives a reward of some size.

However, naively computing the penalty upfront for every offset would bea bit wasteful. Instead, start from the middle of the page, and walk"outwards" towards both ends, until you find a "good enough" penalty.


Or something like that...


Now, the laundry list of smaller items:

----- laundry list begins -----

1st commits commit message:

Make nbtree treat all index tuples as having a heap TID trailing key
attribute.  Heap TID becomes a first class part of the key space on all
levels of the tree.  Index searches can distinguish duplicates by heap
TID, at least in principle.


What do you mean by "at least in principle"?

Secondary index insertions will descend
straight to the leaf page that they'll insert on to (unless there is a
concurrent page split).


What is a "Secondary" index insertion?

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will generally truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially when there are several attributes in an index.

Suggestion: "when there are several attributes in an index" -> "in amulti-column index"

+/*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+       Min(IndexRelationGetNumberOfKeyAttributes(rel), 
BTreeTupleGetNAtts(itup, rel))
+


What is "low-context fashion"?

+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  Every attribute should have an
+ * entry during insertion, though not necessarily when a regular index scan
+ * uses an insertion scankey to find an initial leaf page.

Suggestion: Reword to something like "During insertion, there must be ascan key for every attribute, but when starting a regular index scan,some can be omitted."

+typedef struct BTScanInsertData
+{
+       /*
+        * Mutable state used by _bt_binsrch() to inexpensively repeat a binary
+        * search on the leaf level when only scantid has changed.  Only used 
for
+        * insertions where _bt_check_unique() is called.
+        */
+       bool            savebinsrch;
+       bool            restorebinsrch;
+       OffsetNumber low;
+       OffsetNumber high;
+
+       /* State used to locate a position at the leaf level */
+       bool            heapkeyspace;
+       bool            nextkey;
+       ItemPointer scantid;            /* tiebreaker for scankeys */
+       int                     keysz;                  /* Size of scankeys */
+       ScanKeyData scankeys[INDEX_MAX_KEYS];   /* Must appear last */
+} BTScanInsertData;

It would feel more natural to me, to have the mutable state *after* theother fields. Also, it'd feel less error-prone to have 'scantid' beItemPointerData, rather than a pointer to somewhere else. The'heapkeyspace' name isn't very descriptive. I understand that it meansthat the heap TID is part of the keyspace. Not sure what to suggestinstead, though.

+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap item
+pointer order.


Suggestion: "item pointer" -> TID, to use consistent terms.

We don't use btree keys to disambiguate downlinks from the
+internal pages during a page split, though: only one entry in the parent
+level will be pointing at the page we just split, so the link fields can be
+used to re-find downlinks in the parent via a linear search.  (This is
+actually a legacy of when heap TID was not treated as part of the keyspace,
+but it does no harm to keep things that way.)


I don't understand this paragraph.

+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to distinguish tuples which don't point to heap tuples, that are
+used only for tree navigation.  Pivot tuples include all tuples on
+non-leaf pages and high keys on leaf pages.

Suggestion: reword to "All tuples on non-leaf pages, and high keys onleaf pages, are pivot tuples"

Note that pivot tuples are
+only used to represent which part of the key space belongs on each page,
+and can have attribute values copied from non-pivot tuples that were
+deleted and killed by VACUUM some time ago.  A pivot tuple may contain a
+"separator" key and downlink, just a separator key (in practice the
+downlink will be garbage), or just a downlink.


Rather than store garbage, set it to zeros?

+Lehman and Yao require that the key range for a subtree S is described by
+Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page.
+A search where the scan key is equal to a pivot tuple in an upper tree
+level must descend to the left of that pivot to ensure it finds any equal
+keys.  Pivot tuples are always a _strict_ lower bound on items on their
+downlink page; the equal item(s) being searched for must therefore be to
+the left of that downlink page on the next level down.  (It's possible to
+arrange for internal page tuples to be strict lower bounds in all cases
+because their values come from leaf tuples, which are guaranteed unique by
+the use of heap TID as a tiebreaker.  We also make use of hard-coded
+negative infinity values in internal pages.  Rightmost pages don't have a
+high key, though they conceptually have a positive infinity high key).  A
+handy property of this design is that there is never any need to
+distinguish between equality in the case where all attributes/keys are used
+in a scan from equality where only some prefix is used.

"distringuish between ... from ..." doesn't sound like correct grammar.Suggestion: "distinguish between ... and ...", or just "distinguish ...from ...". Or rephrase the sentence some other way.

+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split when the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left page,
+and the first index tuple on the post-split right page as belonging on the
+right page.


That's a very long sentence.

                         * Since the truncated tuple is probably smaller than 
the
                         * original, it cannot just be copied in place 
(besides, we want
                         * to actually save space on the leaf page).  We delete 
the
                         * original high key, and add our own truncated high 
key at the
                         * same offset.  It's okay if the truncated tuple is 
slightly
                         * larger due to containing a heap TID value, since 
pivot tuples
                         * are treated as a special case by 
_bt_check_third_page().

By "treated as a special case", I assume that _bt_check_third_page()always reserves some space for that? Maybe clarify that somehow.


_bt_truncate():

This is possible when there are
 * attributes that follow an attribute in firstright that is not equal to the
 * corresponding attribute in lastleft (equal according to insertion scan key
 * semantics).

I can't comprehend that sentence. Simpler English, maybe add an example,please.

/*
 * _bt_leave_natts - how many key attributes to leave when truncating.
 *
 * Caller provides two tuples that enclose a split point.  CREATE INDEX
 * callers must pass build = true so that we may avoid metapage access.  (This
 * is okay because CREATE INDEX always creates an index on the latest btree
 * version.)
 *
 * This can return a number of attributes that is one greater than the
 * number of key attributes for the index relation.  This indicates that the
 * caller must use a heap TID as a unique-ifier in new pivot tuple.
 */
static int
_bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
                                bool build)


IMHO "keep" would sound better here than "leave".

+       if (needheaptidspace)
+               ereport(ERROR,
+                               (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+                                errmsg("index row size %zu exceeds btree version %u maximum 
%zu for index \"%s\"",
+                                               itemsz, BTREE_VERSION, 
BTMaxItemSize(page),
+                                               RelationGetRelationName(rel)),
+                                errdetail("Index row references tuple (%u,%u) in relation 
\"%s\".",
+                                                  
ItemPointerGetBlockNumber(&newtup->t_tid),
+                                                  
ItemPointerGetOffsetNumber(&newtup->t_tid),
+                                                  
RelationGetRelationName(heap)),
+                                errhint("Values larger than 1/3 of a buffer page 
cannot be indexed.\n"
+                                                "Consider a function index of an 
MD5 hash of the value, "
+                                                "or use full text indexing."),
+                                errtableconstraint(heap,
+                                                                       
RelationGetRelationName(rel))));
+       else
+               ereport(ERROR,
+                               (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+                                errmsg("index row size %zu exceeds btree version 3 maximum 
%zu for index \"%s\"",
+                                               itemsz, 
BTMaxItemSizeNoHeapTid(page),
+                                               RelationGetRelationName(rel)),
+                                errdetail("Index row references tuple (%u,%u) in relation 
\"%s\".",
+                                                  
ItemPointerGetBlockNumber(&newtup->t_tid),
+                                                  
ItemPointerGetOffsetNumber(&newtup->t_tid),
+                                                  
RelationGetRelationName(heap)),
+                                errhint("Values larger than 1/3 of a buffer page 
cannot be indexed.\n"
+                                                "Consider a function index of an 
MD5 hash of the value, "
+                                                "or use full text indexing."),
+                                errtableconstraint(heap,
+                                                                       
RelationGetRelationName(rel))));

Could restructure this to avoid having two almost identical strings totranslate.

 #define BTREE_METAPAGE 0               /* first page is meta */
 #define BTREE_MAGIC            0x053162        /* magic number of btree pages 
*/
-#define BTREE_VERSION  3               /* current version number */
+#define BTREE_VERSION  4               /* current version number */
 #define BTREE_MIN_VERSION      2       /* minimal supported version number */
+#define BTREE_META_VERSION     3       /* minimal version with all meta fields 
*/

BTREE_META_VERSION is a strange name for version 3. I think thisdeserves a more verbose comment, above these #defines, to list all theversions and their differences.


v9-0003-Pick-nbtree-split-points-discerningly.patch commit message:

Add infrastructure to determine where the earliest difference appears
among a pair of tuples enclosing a candidate split point.


I don't understand this sentence.

_bt_findsplitloc() is also taught to care about the case where there are
many duplicates, making it hard to find a distinguishing split point.
_bt_findsplitloc() may even conclude that it isn't possible to avoid
filling a page entirely with duplicates, in which case it packs pages
full of duplicates very tightly.

Hmm. Is the assumption here that if a page is full of duplicates, therewill be no more insertions into that page? Why?

The number of cycles added is not very noticeable, which is important,
since _bt_findsplitloc() is run while an exclusive (leaf page) buffer
lock is held.  We avoid using authoritative insertion scankey
comparisons, unlike suffix truncation proper.

What do you do instead, then? memcmp? (Reading the patch, yes.Suggestion: "We use a faster binary comparison, instead of properdatatype-aware comparison, for speed".

Aside from performance, it would feel inappropriate to call user-definedcode while holding a buffer lock, anyway.

+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already, without provoking a split.

I'd leave out the ", without provoking a split" part. Or maybe reword to"if you pretend that the incoming tuple fit and was placed on the pagealready".

+Choosing the split point between two index tuples with differences that
+appear as early as possible results in truncating away as many suffix
+attributes as possible.

It took me a while to understand what the "appear as early as possible"means here. It's talking about a multi-column index, and about finding adifference in one of the leading key columns. Not, for example, aboutfinding a split point early in the page.

An array of acceptable candidate split points
+(points that balance free space on either side of the split sufficiently
+well) is assembled in a pass over the page to be split, sorted by delta.
+An optimal split point is chosen during a pass over the assembled array.
+There are often several split points that allow the maximum number of
+attributes to be truncated away -- we choose whichever one has the lowest
+free space delta.

Perhaps we should leave out these details in the README, and explainthis in the comments of the picksplit-function itself? In the README, Ithink a more high-level description of what things are taken intoaccount when picking the split point, would be enough.

+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.


Suggestion: reword to "... , but that isn't the only benefit" ?

There are cases where suffix truncation can
+leave a B-Tree significantly smaller in size than it would have otherwise
+been without actually making any pivot tuple smaller due to restrictions
+relating to alignment.

Suggestion: reword to "... smaller in size than it would otherwise be,without ..."


and "without making any pivot tuple *physically* smaller, due to alignment".

This sentence is a bit of a cliffhanger: what are those cases, and howis that possible?

The criteria for choosing a leaf page split point
+for suffix truncation is also predictive of future space utilization.


How so? What does this mean?

+Furthermore, even truncation that doesn't make pivot tuples smaller still
+prevents pivot tuples from being more restrictive than truly necessary in
+how they describe which values belong on which pages.

Ok, I guess these sentences resolve the cliffhanger I complained about.But this still feels like magic. When you split a page, all of thekeyspace must belong on the left or the right page. Why does it make adifference to space utilization, where exactly you split the key space?

+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller and less
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, less discriminating
+pivot tuples end up in the root page, delaying root page splits.

Ok, so this explains it further, I guess. I find this paragraphdifficult to understand, though. The important thing here is the ideathat some split points are more "discriminating" than others, but Ithink it needs some further explanation. What makes a split point morediscriminating? Maybe add an example.

+Suffix truncation may make a pivot tuple *larger* than the non-pivot/leaf
+tuple that it's based on (the first item on the right page), since a heap
+TID must be appended when nothing else distinguishes each side of a leaf
+split.  Truncation cannot simply reuse the leaf level representation: we
+must append an additional attribute, rather than incorrectly leaving a heap
+TID in the generic IndexTuple item pointer field.  (The field is already
+used by pivot tuples to store their downlink, plus some additional
+metadata.)

That's not really the fault of suffix truncation as such, but theprocess of turning a leaf tuple into a pivot tuple. It would happen evenif you didn't truncate anything.

I think this point, that we have to store the heap TID differently inpivot tuples, would deserve a comment somewhere else, too. While readingthe patch, I didn't realize that that's what we're doing, until I readthis part of the README, even though I saw the new code to deal withheap TIDs elsewhere in the code. Not sure where, maybe inBTreeTupleGetHeapTID().

+Adding a heap TID attribute during a leaf page split should only occur when
+the page to be split is entirely full of duplicates (the new item must also
+be a duplicate).  The logic for selecting a split point goes to great
+lengths to avoid heap TIDs in pivots --- "many duplicates" mode almost
+always manages to pick a split point between two user-key-distinct tuples,
+accepting a completely lopsided split if it must.

This is the first mention of "many duplicates" mode. Maybe just say"_bt_findsplitloc() almost always ..." or "The logic for selecting asplit point goes to great lengths to avoid heap TIDs in pivots, andalmost always manages to pick a split point between twouser-key-distinct tuples, accepting a completely lopsided split if it must."

Once appending a heap
+TID to a split's pivot becomes completely unavoidable, there is a fallback
+strategy --- "single value" mode is used, which makes page splits pack the
+new left half full by using a high fillfactor.  Single value mode leads to
+better overall space utilization when a large number of duplicates are the
+norm, and thereby also limits the total number of pivot tuples with an
+untruncated heap TID attribute.

This assumes that tuples are inserted in increasing TID order, right?Seems like a valid assumption, no complaints there, but it's anassumption nevertheless.

I'm not sure if this level of detail is worthwhile in the README. Thislogic on deciding the split point is all within the _bt_findsplitloc()function, so maybe put this explanation there. In the README, a morehigh-level explanation of what things _bt_findsplitloc() considers,should be enough.

_bt_findsplitloc(), and all its helper structs and subroutines, areabout 1000 lines of code now, and big part of nbtinsert.c. Perhaps itwould be a good idea to move it to a whole new nbtsplitloc.c file? It'sa very isolated piece of code.


In the comment on _bt_leave_natts_fast():

+ * Testing has shown that an approach involving treating the tuple as a
+ * decomposed binary string would work almost as well as the approach taken
+ * here.  It would also be faster.  It might actually be necessary to go that
+ * way in the future, if suffix truncation is made sophisticated enough to
+ * truncate at a finer granularity (i.e. truncate within an attribute, rather
+ * than just truncating away whole attributes).  The current approach isn't
+ * markedly slower, since it works particularly well with the "perfect
+ * penalty" optimization (there are fewer, more expensive calls here).  It
+ * also works with INCLUDE indexes (indexes with non-key attributes) without
+ * any special effort.

That's an interesting tidbit, but I'd suggest just removing this commentaltogether. It's not really helping to understand the currentimplementation.


v9-0005-Add-high-key-continuescan-optimization.patch commit message:

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.

.. but we're missing the other optimizations that make it moreeffective, so it probably won't do much for v3 indexes. Does it makethem slower? It's probably acceptable, even if there's a tinyregression, but I'm curious.


----- laundry list ends -----

- Heikki

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Reply via email to