Re: WIP: Covering + unique indexes.

Peter Geoghegan Tue, 03 Apr 2018 20:10:10 -0700

On Tue, Apr 3, 2018 at 7:02 AM, Alexander Korotkov
<a.korot...@postgrespro.ru> wrote:
> Great, I'm looking forward your feedback.


I took a look at V11 (0001-Covering-core-v11.patch,
0002-Covering-btree-v11.patch, 0003-Covering-amcheck-v11.patch,
0004-Covering-natts-v11.patch) today.

* What's a pivot tuple?

This is the same thing as what I call a "separator key", I think --
you're talking about the set of IndexTuples including all high keys
(including leaf level high keys), as well as internal items
(downlinks). I think that it's a good idea to have a standard word
that describes this set of keys, to formalize the two categories
(pivot tuples vs. tuples that point to the heap itself). Your word is
just as good as mine, so we can go with that.

Let's put this somewhere central. Maybe in the nbtree README, and/or
nbtree.h. Also, verify_nbtree.c should probably get some small
explanation of pivot tuples. offset_is_negative_infinity() is a nice
place to mention pivot tuples, since that already has a bit of
high-level commentary about them.

* Compiler warning:

/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:
In function ‘index_create’:
/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:476:45:
warning: ‘opclassTup’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
   if (keyType == ANYELEMENTOID && opclassTup->opcintype == ANYARRAYOID)
                                             ^
/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:332:19:
note: ‘opclassTup’ was declared here
   Form_pg_opclass opclassTup;
                   ^

* Your new amcheck tests should definitely use the new
"heapallindexed" option. There were a number of bugs I can remember
seeing in earlier versions of this patch that that would catch
(probably not during regression tests, but let's at least do that
much).

* The modified amcheck contrib regression tests don't actually pass. I
see these unexpected errors:

10037/2018-04-03 16:31:12 PDT ERROR:  wrong number of index tuple
attributes for index "bttest_multi_idx"
10037/2018-04-03 16:31:12 PDT DETAIL:  Index tid=(290,2) points to
index tid=(289,2) page lsn=0/162407A8.
10037/2018-04-03 16:31:12 PDT ERROR:  wrong number of index tuple
attributes for index "bttest_multi_idx"
10037/2018-04-03 16:31:12 PDT DETAIL:  Index tid=(290,2) points to
index tid=(289,2) page lsn=0/162407A8.

* I see that we use "- 1" with attribute number, like this:

> +/* Get number of attributes in B-tree index tuple */
> +#define BtreeTupGetNAtts(itup, index)  \
> +   ( \
> +       (itup)->t_info & INDEX_ALT_TID_MASK ? \
> +       ( \
> +           AssertMacro((ItemPointerGetOffsetNumber(&(itup)->t_tid) & 
> BT_RESERVED_OFFSET_MASK) == 0), \
> +           ItemPointerGetOffsetNumber(&(itup)->t_tid) & 
> BT_N_KEYS_OFFSET_MASK - 1 \
> +       ) \
> +       : \
> +       IndexRelationGetNumberOfAttributes(index) \
> +   )

Is this left behind from before you decided to adopt
INDEX_ALT_TID_MASK? Is it your intention here to encode
InvalidOffsetNumber() without tripping up assertions? Or is it
something else?

Maybe we should follow the example of GinItemPointerGetOffsetNumber(),
and use ItemPointerGetOffsetNumberNoCheck() instead of
ItemPointerGetOffsetNumber(). What do you think? That would allow us
to get rid of the -1 thing, which might be nice. Just because we use
ItemPointerGetOffsetNumberNoCheck() in places that use an alternative
offset representation does not mean we need to use it in existing
places. If existing places had a regression tests failure because of
this, that would probably be due to a real bug. No?

* ISTM that the "points to index tid=(289,2)" part of the message just
shown would be a bit clearer if I didn't have to know that 2 actually
means 1 when we talk about the pointed-to offset (yeah, it will
probably become unclear in the future when we start using the reserved
offset status bits, but why not make the low bits of offset
simple/logical way?). Your new amcheck error message should spell it
out (it should say the number of attributes indicated by the offset,
if any) -- regardless of what we do about the "must apply - 1 to
offset" question.

* "Minus infinity" items do not have the new status bit
INDEX_ALT_TID_MASK set in at least some cases. They should.

* _bt_sortaddtup() should not do "trunctuple.t_info =
sizeof(IndexTupleData)", since that destroys useful information. Maybe
that's the reason for the last bug?

* Ditto for _bt_pgaddtup().

* Why expose _bt_pgaddtup() so that nbtsort.c/_bt_buildadd() can call
it? The only reason we have _bt_sortaddtup() is because we cannot
trust P_RIGHTMOST() within _bt_pgaddtup() when called in the context
of CREATE INDEX (from nbtsort.c/_bt_buildadd()). There is no real
change needed, because _bt_sortaddtup() knows that it's inserting on a
non-rightmost page both without this patch, and when this patch needs
to truncate and then add the high key back.

It's clear that you can just use _bt_sortaddtup() (and leave
_bt_pgaddtup() private) because _bt_sortaddtup() is only different to
_bt_pgaddtup() when !P_ISLEAF(), but we only call _bt_pgaddtup() when
P_ISLEAF(). Or have I missed something?

* For inserts, this patch performs an extra truncation step on the
same high key that we'd use with a plain (non-covering/include) index.
That's pretty clean. But it seems more complicated for
nbtsort.c/_bt_buildadd(). I think that a comment should say that we
cannot just rearrange item pointers for high key on the old page when
we also truncate, because overwriting the P_HIKEY position ItemId with
the old page's former final ItemId (whose tuple ended up becoming the
first tuple on new/right page) fails to actually save any space. We
need to truly shift around IndexTuples on the page in order to save
space (both PageIndexTupleDelete() and PageAddItem() end up shifting
both the ItemId array and some IndexTuple space).

Also, maybe say that the performance here really isn't so bad, because
we reclaim IndexTuple space close to the middle of the hole in the
page with our PageIndexTupleDelete(), and then use almost the *same*
space within PageAddItem(). There is not actually that much physical
shifting around for IndexTuples. It turns out that it's not that
different. (You can probably find a better, more succinct way of
putting this -- I'm tired now.)

* I suggest that you teach _bt_check_natts() to expect zero attributes
for "minus infinity" items. It looks like amcheck contrib regression
tests don't pass because you don't look for that (P_FIRSTDATAKEY() is
the "minus infinity" item on internal pages).

* bt_target_page_check() should also have a !P_ISLEAF() check, since
with a covering index every tuple will have INDEX_ALT_TID_MASK. This
should call _bt_check_natts() for each item, including the "minus
infinity" items.

* "minus infinity" items don't have the right number of attributes
set, in at least some cases that I saw. The number matched other
internal items, and wasn't 0 or whatever. Maybe the
ItemPointerGetOffsetNumberNoCheck() idea would leave things so that it
actually could be 0 safely, rather than natts + 1 as you said, which
would be nice.)

* I would reorder the comment to match the order of the code:

> +   /*
> +    * Pivot tuples stored in non-leaf pages and hikeys of leaf pages should
> +    * have nkeyatts number of attributes.  While regular tuples of leaf pages
> +    * should have natts number of attributes.
> +    */
> +   if (P_ISLEAF(opaque) && offnum >= P_FIRSTDATAKEY(opaque))
> +       return (BtreeTupGetNAtts(itup, index) == natts);
> +   else
> +       return (BtreeTupGetNAtts(itup, index) == nkeyatts);

* Please add BT_N_KEYS_OFFSET_MASK + INDEX_MAX_KEYS static assertion.
Maybe add it to _bt_check_natts().

* README-SSI says:

    * The effects of page splits, overflows, consolidations, and
removals must be carefully reviewed to ensure that predicate locks
aren't "lost" during those operations, or kept with pages which could
get re-used for different parts of the index.

Do we need to worry about that here? I guess not, because this is just
like having many duplicates. But a note just above the _bt_doinsert()
call to CheckForSerializableConflictIn() might be a good idea.

That's all I have for today.
-- 
Peter Geoghegan

Re: WIP: Covering + unique indexes.

Reply via email to