On Wed, Apr 4, 2018 at 3:09 PM, Alexander Korotkov <a.korot...@postgrespro.ru> wrote: > Thank you for review! Revised patchset is attached.
Cool. * btree_xlog_split() still has this code: /* * On leaf level, the high key of the left page is equal to the first key * on the right page. */ if (isleaf) { ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque)); left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId); left_hikeysz = ItemIdGetLength(hiItemId); } However, we never fail to store the high key now, even at the leaf level, because of this change to the corresponding point in _bt_split(): > - /* Log left page */ > - if (!isleaf) > - { > - /* > - * We must also log the left page's high key, because the right > - * page's leftmost key is suppressed on non-leaf levels. Show it > - * as belonging to the left page buffer, so that it is not stored > - * if XLogInsert decides it needs a full-page image of the left > - * page. > - */ > - itemid = PageGetItemId(origpage, P_HIKEY); > - item = (IndexTuple) PageGetItem(origpage, itemid); > - XLogRegisterBufData(0, (char *) item, > MAXALIGN(IndexTupleSize(item))); > - } > + /* > + * We must also log the left page's high key. There are two reasons > + * for that: right page's leftmost key is suppressed on non-leaf > levels, > + * in covering indexes, included columns are truncated from high keys. > + * For simplicity, we don't distinguish these cases, but log the high > + * key every time. Show it as belonging to the left page buffer, so > + * that it is not stored if XLogInsert decides it needs a full-page > + * image of the left page. > + */ > + itemid = PageGetItemId(origpage, P_HIKEY); > + item = (IndexTuple) PageGetItem(origpage, itemid); > + XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item))); So should we remove the first block of code? Note also that this existing comment has been made obsolete: /* don't release the buffer yet; we touch right page's first item below */ /* Now reconstruct left (original) sibling page */ if (XLogReadBufferForRedo(record, 0, &lbuf) == BLK_NEEDS_REDO) Maybe we *should* release the right sibling buffer at the point of the comment now? * _bt_mkscankey() should assert that the IndexTuple has the correct number of attributes. I don't expect you to change routines like _bt_mkscankey() so they actually respect the number of attributes from BTreeTupGetNAtts(), rather than just relying on IndexRelationGetNumberOfKeyAttributes(). However, an assertion seems well worthwhile. It's a big reason for having BTreeTupGetNAtts(). This also lets you get rid of at least one assertion from _bt_doinsert(), I think. * _bt_isequal() should assert that the IndexTuple was not truncated. * The order could be switched here: > @@ -443,6 +443,17 @@ _bt_compare(Relation rel, > if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque)) > return 1; > > + /* > + * Check tuple has correct number of attributes. > + */ > + if (unlikely(!_bt_check_natts(rel, page, offnum))) > + { > + ereport(ERROR, > + (errcode(ERRCODE_INTERNAL_ERROR), > + errmsg("tuple has wrong number of attributes in index > \"%s\"", > + RelationGetRelationName(rel)))); > + } In principle, we should also check _bt_check_natts() for "minus infinity" items, just like you did within verify_nbtree.c. Also, there is no need for parenthesis here. * Maybe _bt_truncate_tuple() should assert that the caller has not tried to truncate a tuple that has already been truncated. I'm not sure if our assertion should be quite that strong, but I think that that might be good because in general we only need to truncate on the leaf level -- truncating at any other level on the tree (e.g. doing traditional suffix truncation) is always subtly wrong. What we definitely should do, at a minimum, is make sure that attempting to truncate a tuple to 2 attributes when it already has 0 attributes fails with an assertion failure. Can you try adding the strong assertion (truncate only once) to _bt_truncate_tuple()? Maybe that's not possible, but it seems worth a try. * I suggest we invent a new flag for 0x2000 within itup.h, to replace "/* bit 0x2000 is reserved for index-AM specific usage */". We can call it INDEX_AM_RESERVED_BIT. Then, we can change INDEX_ALT_TID_MASK to use this rather than a raw 0x2000. We can do the same for INDEX_MOVED_BY_SPLIT_MASK within hash.h, too. I find this neater. * We should "use" one of the 4 new status bit that are available from an offset (for INDEX_ALT_TID_MASK index tuples) for future use by leaf index tuples. Perhaps call it BT_ALT_TID_NONPIVOT. I guess you could say that I want us to reserve one of our 4 reserve bits. * I think that you could add to this: > +++ b/src/backend/access/nbtree/README > @@ -590,6 +590,10 @@ original search scankey is consulted as each index entry > is sequentially > scanned to decide whether to return the entry and whether the scan can > stop (see _bt_checkkeys()). > > +We use term "pivot" index tuples to distinguish tuples which don't point > +to heap tuples, but rather used for tree navigation. Pivot tuples includes > +all tuples on non-leaf pages and high keys on leaf pages. I like what you came up with, and where you put it, but I would add another few sentences: "Note that pivot index tuples are only used to represent which part of the key space belongs on each page, and can have attribute values copied from non-pivot tuples that were deleted and killed by VACUUM some time ago. In principle, we could truncate away attributes that are not needed for a page high key during a leaf page split, provided that the remaining attributes distinguish the last index tuple on the post-split left page as belonging on the left page, and the first index tuple on the post-split right page as belonging on the right page. This optimization is sometimes called suffix truncation, and may appear in a future release. Since the high key is subsequently reused as the downlink in the parent page for the new right page, suffix truncation can increase index fan-out considerably by keeping pivot tuples short. INCLUDE indexes similarly truncate away non-key attributes at the time of a leaf page split, increasing fan-out." > Good point. Tests with "heapallindexed" were added. I also find that it's > useful to > check both index built by sorting, and index built by insertions, because > there are > different ways of forming tuples. Right. It's a good cross-check for things like that. We'll have to teach bt_tuple_present_callback() to normalize the representation in some way for the BT_ALT_TID_NONPIVOT case in the future. But it already talks about normalizing for reasons like this, so that's okay. * I think you should add a note about BT_ALT_TID_NONPIVOT to bt_tuple_present_callback(), though. If it cannot be sure that every non-pivot tuple will have the same representation, amcheck will have to normalize to the most flexible representation before hashing. > Ok. I've tried to remove both assertions and "+1" hack. That works > for me. However, I've to touch a lot of places, not sure if that's a > problem. Looks good to me. If it makes an assertion fail, that's probably a good thing, because it would have been broken before anyway. * You missed this comment, which is now not accurate: > + * It's possible that index tuple has zero attributes (leftmost item of > + * iternal page). And we have assertion that offset number is greater or > equal > + * to 1. This is why we store (number_of_attributes + 1) in offset number. > + */ I can see that it is actually 0 for a minus infinity item, which is good. > I wrote some comment there. Please, check it. The nbtsort.c comments could maybe do with some tweaks from a native speaker, but look correct. > Regarding !P_ISLEAF(), I think we should check every item on both > leaf and non-leaf pages. I that is how code now works unless I'm missing > something. It does, and should. Thanks. > Thanks for pointing. Since there are now three cases including handling of > "minus infinity" items, comment is now split to three. That looks good. Thanks. Right now, it looks like every B-Tree index could use INDEX_ALT_TID_MASK, regardless of whether or not it's an INCLUDE index. I think that that's fine, but let's say so in the paragraph that introduces INDEX_ALT_TID_MASK. This patch establishes that any nbtree pivot tuple could have INDEX_ALT_TID_MASK set, and that's something that can be expected. It's also something that might not be set when pg_upgrade was used, but that's fine too. > I don't seerelations between this patchset and SSI. We just > change representation of some index tuples in pages. However, > we didn't change the the order of page modification, the order > of page lookup and so on. Yes, we change size of some tuples, > but B-tree already worked with tuples of variable sizes. So, the fact > that tuples now have different size shouldn't affect SSI. Right now, > I'm not sure if CheckForSerializableConflictIn() just above the > _bt_doinsert() is good idea. But even if so, I think that should be > a subject of separate patch. My point was that that nothing changes, because we already use what _bt_doinsert() calls the "first valid" page. Maybe just add: "(This reasoning also applies to INCLUDE indexes, whose extra attributes are not considered part of the key space.)". That's it for today. -- Peter Geoghegan