On Sun, Mar 3, 2019 at 10:02 PM Heikki Linnakangas <hlinn...@iki.fi> wrote: > Some comments on > v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below. > Mostly about code comments. In general, I think a round of copy-editing > the comments, to use simpler language, would do good. The actual code > changes look good to me.
I'm delighted that the code looks good to you, and makes sense overall. I worked hard to make the patch a natural adjunct to the existing code, which wasn't easy. > Seems confusing to first say assertively that "*bufptr contains the page > that the new tuple unambiguously belongs to", and then immediately go on > to list a whole bunch of exceptions. Maybe just remove "unambiguously". This is fixed in v14 of the patch series. > This happens very seldom, because you only get an incomplete split if > you crash in the middle of a page split, and that should be very rare. I > don't think we need to list more fine-grained conditions here, that just > confuses the reader. Fixed in v14. > > /* > > * _bt_useduplicatepage() -- Settle for this page of duplicates? > So, this function is only used for legacy pg_upgraded indexes. The > comment implies that, but doesn't actually say it. I made that more explicit in v14. > > /* > > * Get tie-breaker heap TID attribute, if any. Macro works with both pivot > > * and non-pivot tuples, despite differences in how heap TID is represented. > > #define BTreeTupleGetHeapTID(itup) ... I fixed up the comments above BTreeTupleGetHeapTID() significantly. > The comment claims that "all pivot tuples must be as of BTREE_VERSION > 4". I thought that all internal tuples are called pivot tuples, even on > version 3. In my mind, "pivot tuple" is a term that describes any tuple that contains a separator key, which could apply to any nbtree version. It's useful to have a distinct term (to not just say "separator key tuple") because Lehman and Yao think of separator keys as separate and distinct from downlinks. Internal page splits actually split *between* a separator key and a downlink. So nbtree internal page splits must split "inside a pivot tuple", leaving its separator on the left hand side (new high key), and its downlink on the right hand side (new minus infinity tuple). Pivot tuples may contain a separator key and a downlink, just a downlink, or just a separator key (sometimes this is implicit, and the block number is garbage). I am particular about the terminology because the pivot tuple vs. downlink vs. separator key thing causes a lot of confusion, particularly when you're using Lehman and Yao (or Lanin and Shasha) to understand how things work in Postgres. We wan't to have a broad term that refers to the tuples that describe the keyspace (pivot tuples), since it's often helpful to refer to them collectively, without seeming to contradict Lehman and Yao. > I think what this means to say is that this macro is only > used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only > have a heap TID in BTREE_VERSION 4 indexes. My high level approach to pg_upgrade/versioning is for index scans to *pretend* that every nbtree index (even on v2 and v3) has a heap attribute that actually makes the keys unique. The difference is that v4 gets to use a scantid, and actually rely on the sort order of heap TIDs, whereas pg_upgrade'd indexes "are not allowed to look at the heap attribute", and must never provide a scantid (they also cannot use the !minusinfkey optimization, but this is only an optimization that v4 indexes don't truly need). They always do the right thing (move left) on otherwise-equal pivot tuples, since they have no scantid. That's why _bt_compare() can use BTreeTupleGetHeapTID() without caring about the version the index uses. It might be NULL for a pivot tuple in a v3 index, even though we imagine/pretend that it should have a value set. But that doesn't matter, because higher level code knows that !heapkeyspace indexes should never get a scantid (_bt_compare() does Assert() that they got that detail right, though). We "have no reason to peak", because we don't have a scantid, so index scans work essentially the same way, regardless of the version in use. There are a few specific cross-version things that we need think about outside of making sure that there is never a scantid (and !minusinfkey optimization is unused) in < v4 indexes, but these are all related to unique indexes. "Pretending that all indexes have a heap TID" is a very useful mental model. Nothing really changes, even though you might guess that changing the classic "Subtree S is described by Ki < v <= Ki+1" invariant would need to break code in _bt_binsrch()/_bt_compare(). Just pretend that the classic invariant was there since the Berkeley days, and don't do anything that breaks the useful illusion on versions before v4. > This macro (and many others in nbtree.h) is quite complicated. A static > inline function might be easier to read. I agree that the macros are complicated, but that seems to be because the rules are complicated. I'd rather leave the macros in place, and improve the commentary on the rules. > 'xlmeta.version' is set incorrectly. Oops. Fixed in v14. > I find this comment difficult to read. I suggest rewriting it to: > > /* > * The current Btree version is 4. That's what you'll get when you create > * a new index. I used your wording for this in v14, almost verbatim. > Now that the index tuple format becomes more complicated, I feel that > there should be some kind of an overview explaining the format. All the > information is there, in the comments in nbtree.h, but you have to piece > together all the details to get the overall picture. I wrote this to > keep my head straight: v14 uses your diagrams in nbtree.h, and expands some existing discussion of INCLUDE indexes/non-key attributes/tuple format. Let me know what you think. -- Peter Geoghegan