On Tue, Apr 3, 2018 at 7:02 AM, Alexander Korotkov <a.korot...@postgrespro.ru> wrote: > Great, I'm looking forward your feedback.
I took a look at V11 (0001-Covering-core-v11.patch, 0002-Covering-btree-v11.patch, 0003-Covering-amcheck-v11.patch, 0004-Covering-natts-v11.patch) today. * What's a pivot tuple? This is the same thing as what I call a "separator key", I think -- you're talking about the set of IndexTuples including all high keys (including leaf level high keys), as well as internal items (downlinks). I think that it's a good idea to have a standard word that describes this set of keys, to formalize the two categories (pivot tuples vs. tuples that point to the heap itself). Your word is just as good as mine, so we can go with that. Let's put this somewhere central. Maybe in the nbtree README, and/or nbtree.h. Also, verify_nbtree.c should probably get some small explanation of pivot tuples. offset_is_negative_infinity() is a nice place to mention pivot tuples, since that already has a bit of high-level commentary about them. * Compiler warning: /home/pg/postgresql/root/build/../source/src/backend/catalog/index.c: In function ‘index_create’: /home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:476:45: warning: ‘opclassTup’ may be used uninitialized in this function [-Wmaybe-uninitialized] if (keyType == ANYELEMENTOID && opclassTup->opcintype == ANYARRAYOID) ^ /home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:332:19: note: ‘opclassTup’ was declared here Form_pg_opclass opclassTup; ^ * Your new amcheck tests should definitely use the new "heapallindexed" option. There were a number of bugs I can remember seeing in earlier versions of this patch that that would catch (probably not during regression tests, but let's at least do that much). * The modified amcheck contrib regression tests don't actually pass. I see these unexpected errors: 10037/2018-04-03 16:31:12 PDT ERROR: wrong number of index tuple attributes for index "bttest_multi_idx" 10037/2018-04-03 16:31:12 PDT DETAIL: Index tid=(290,2) points to index tid=(289,2) page lsn=0/162407A8. 10037/2018-04-03 16:31:12 PDT ERROR: wrong number of index tuple attributes for index "bttest_multi_idx" 10037/2018-04-03 16:31:12 PDT DETAIL: Index tid=(290,2) points to index tid=(289,2) page lsn=0/162407A8. * I see that we use "- 1" with attribute number, like this: > +/* Get number of attributes in B-tree index tuple */ > +#define BtreeTupGetNAtts(itup, index) \ > + ( \ > + (itup)->t_info & INDEX_ALT_TID_MASK ? \ > + ( \ > + AssertMacro((ItemPointerGetOffsetNumber(&(itup)->t_tid) & > BT_RESERVED_OFFSET_MASK) == 0), \ > + ItemPointerGetOffsetNumber(&(itup)->t_tid) & > BT_N_KEYS_OFFSET_MASK - 1 \ > + ) \ > + : \ > + IndexRelationGetNumberOfAttributes(index) \ > + ) Is this left behind from before you decided to adopt INDEX_ALT_TID_MASK? Is it your intention here to encode InvalidOffsetNumber() without tripping up assertions? Or is it something else? Maybe we should follow the example of GinItemPointerGetOffsetNumber(), and use ItemPointerGetOffsetNumberNoCheck() instead of ItemPointerGetOffsetNumber(). What do you think? That would allow us to get rid of the -1 thing, which might be nice. Just because we use ItemPointerGetOffsetNumberNoCheck() in places that use an alternative offset representation does not mean we need to use it in existing places. If existing places had a regression tests failure because of this, that would probably be due to a real bug. No? * ISTM that the "points to index tid=(289,2)" part of the message just shown would be a bit clearer if I didn't have to know that 2 actually means 1 when we talk about the pointed-to offset (yeah, it will probably become unclear in the future when we start using the reserved offset status bits, but why not make the low bits of offset simple/logical way?). Your new amcheck error message should spell it out (it should say the number of attributes indicated by the offset, if any) -- regardless of what we do about the "must apply - 1 to offset" question. * "Minus infinity" items do not have the new status bit INDEX_ALT_TID_MASK set in at least some cases. They should. * _bt_sortaddtup() should not do "trunctuple.t_info = sizeof(IndexTupleData)", since that destroys useful information. Maybe that's the reason for the last bug? * Ditto for _bt_pgaddtup(). * Why expose _bt_pgaddtup() so that nbtsort.c/_bt_buildadd() can call it? The only reason we have _bt_sortaddtup() is because we cannot trust P_RIGHTMOST() within _bt_pgaddtup() when called in the context of CREATE INDEX (from nbtsort.c/_bt_buildadd()). There is no real change needed, because _bt_sortaddtup() knows that it's inserting on a non-rightmost page both without this patch, and when this patch needs to truncate and then add the high key back. It's clear that you can just use _bt_sortaddtup() (and leave _bt_pgaddtup() private) because _bt_sortaddtup() is only different to _bt_pgaddtup() when !P_ISLEAF(), but we only call _bt_pgaddtup() when P_ISLEAF(). Or have I missed something? * For inserts, this patch performs an extra truncation step on the same high key that we'd use with a plain (non-covering/include) index. That's pretty clean. But it seems more complicated for nbtsort.c/_bt_buildadd(). I think that a comment should say that we cannot just rearrange item pointers for high key on the old page when we also truncate, because overwriting the P_HIKEY position ItemId with the old page's former final ItemId (whose tuple ended up becoming the first tuple on new/right page) fails to actually save any space. We need to truly shift around IndexTuples on the page in order to save space (both PageIndexTupleDelete() and PageAddItem() end up shifting both the ItemId array and some IndexTuple space). Also, maybe say that the performance here really isn't so bad, because we reclaim IndexTuple space close to the middle of the hole in the page with our PageIndexTupleDelete(), and then use almost the *same* space within PageAddItem(). There is not actually that much physical shifting around for IndexTuples. It turns out that it's not that different. (You can probably find a better, more succinct way of putting this -- I'm tired now.) * I suggest that you teach _bt_check_natts() to expect zero attributes for "minus infinity" items. It looks like amcheck contrib regression tests don't pass because you don't look for that (P_FIRSTDATAKEY() is the "minus infinity" item on internal pages). * bt_target_page_check() should also have a !P_ISLEAF() check, since with a covering index every tuple will have INDEX_ALT_TID_MASK. This should call _bt_check_natts() for each item, including the "minus infinity" items. * "minus infinity" items don't have the right number of attributes set, in at least some cases that I saw. The number matched other internal items, and wasn't 0 or whatever. Maybe the ItemPointerGetOffsetNumberNoCheck() idea would leave things so that it actually could be 0 safely, rather than natts + 1 as you said, which would be nice.) * I would reorder the comment to match the order of the code: > + /* > + * Pivot tuples stored in non-leaf pages and hikeys of leaf pages should > + * have nkeyatts number of attributes. While regular tuples of leaf pages > + * should have natts number of attributes. > + */ > + if (P_ISLEAF(opaque) && offnum >= P_FIRSTDATAKEY(opaque)) > + return (BtreeTupGetNAtts(itup, index) == natts); > + else > + return (BtreeTupGetNAtts(itup, index) == nkeyatts); * Please add BT_N_KEYS_OFFSET_MASK + INDEX_MAX_KEYS static assertion. Maybe add it to _bt_check_natts(). * README-SSI says: * The effects of page splits, overflows, consolidations, and removals must be carefully reviewed to ensure that predicate locks aren't "lost" during those operations, or kept with pages which could get re-used for different parts of the index. Do we need to worry about that here? I guess not, because this is just like having many duplicates. But a note just above the _bt_doinsert() call to CheckForSerializableConflictIn() might be a good idea. That's all I have for today. -- Peter Geoghegan