Re: WIP: Covering + unique indexes.

Peter Geoghegan Wed, 04 Apr 2018 19:41:47 -0700

On Wed, Apr 4, 2018 at 3:09 PM, Alexander Korotkov
<a.korot...@postgrespro.ru> wrote:
> Thank you for review!  Revised patchset is attached.


Cool.

* btree_xlog_split() still has this code:

    /*
     * On leaf level, the high key of the left page is equal to the first key
     * on the right page.
     */
    if (isleaf)
    {
        ItemId      hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));

        left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
        left_hikeysz = ItemIdGetLength(hiItemId);
    }

However, we never fail to store the high key now, even at the leaf
level, because of this change to the corresponding point in
_bt_split():

> -       /* Log left page */
> -       if (!isleaf)
> -       {
> -           /*
> -            * We must also log the left page's high key, because the right
> -            * page's leftmost key is suppressed on non-leaf levels.  Show it
> -            * as belonging to the left page buffer, so that it is not stored
> -            * if XLogInsert decides it needs a full-page image of the left
> -            * page.
> -            */
> -           itemid = PageGetItemId(origpage, P_HIKEY);
> -           item = (IndexTuple) PageGetItem(origpage, itemid);
> -           XLogRegisterBufData(0, (char *) item, 
> MAXALIGN(IndexTupleSize(item)));
> -       }
> +       /*
> +        * We must also log the left page's high key.  There are two reasons
> +        * for that: right page's leftmost key is suppressed on non-leaf 
> levels,
> +        * in covering indexes, included columns are truncated from high keys.
> +        * For simplicity, we don't distinguish these cases, but log the high
> +        * key every time.  Show it as belonging to the left page buffer, so
> +        * that it is not stored if XLogInsert decides it needs a full-page
> +        * image of the left page.
> +        */
> +       itemid = PageGetItemId(origpage, P_HIKEY);
> +       item = (IndexTuple) PageGetItem(origpage, itemid);
> +       XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));

So should we remove the first block of code? Note also that this
existing comment has been made obsolete:

/* don't release the buffer yet; we touch right page's first item below */

/* Now reconstruct left (original) sibling page */
if (XLogReadBufferForRedo(record, 0, &lbuf) == BLK_NEEDS_REDO)

Maybe we *should* release the right sibling buffer at the point of the
comment now?

* _bt_mkscankey() should assert that the IndexTuple has the correct
number of attributes.

I don't expect you to change routines like _bt_mkscankey() so they
actually respect the number of attributes from BTreeTupGetNAtts(),
rather than just relying on IndexRelationGetNumberOfKeyAttributes().
However, an assertion seems well worthwhile. It's a big reason for
having BTreeTupGetNAtts().

This also lets you get rid of at least one assertion from
_bt_doinsert(), I think.

* _bt_isequal() should assert that the IndexTuple was not truncated.

* The order could be switched here:

> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
>     if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
>         return 1;
>
> +   /*
> +    * Check tuple has correct number of attributes.
> +    */
> +   if (unlikely(!_bt_check_natts(rel, page, offnum)))
> +   {
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INTERNAL_ERROR),
> +                errmsg("tuple has wrong number of attributes in index 
> \"%s\"",
> +                       RelationGetRelationName(rel))));
> +   }

In principle, we should also check _bt_check_natts() for "minus
infinity" items, just like you did within verify_nbtree.c. Also, there
is no need for parenthesis here.

* Maybe _bt_truncate_tuple() should assert that the caller has not
tried to truncate a tuple that has already been truncated.

I'm not sure if our assertion should be quite that strong, but I think
that that might be good because in general we only need to truncate on
the leaf level -- truncating at any other level on the tree (e.g.
doing traditional suffix truncation) is always subtly wrong. What we
definitely should do, at a minimum, is make sure that attempting to
truncate a tuple to 2 attributes when it already has 0 attributes
fails with an assertion failure.

Can you try adding the strong assertion (truncate only once) to
_bt_truncate_tuple()? Maybe that's not possible, but it seems worth a
try.

* I suggest we invent a new flag for 0x2000 within itup.h, to replace
"/* bit 0x2000 is reserved for index-AM specific usage */".

We can call it INDEX_AM_RESERVED_BIT. Then, we can change
INDEX_ALT_TID_MASK to use this rather than a raw 0x2000. We can do the
same for INDEX_MOVED_BY_SPLIT_MASK within hash.h, too. I find this
neater.

* We should "use" one of the 4 new status bit that are available from
an offset (for INDEX_ALT_TID_MASK index tuples) for future use by leaf
index tuples. Perhaps call it BT_ALT_TID_NONPIVOT.

I guess you could say that I want us to reserve one of our 4 reserve bits.

* I think that you could add to this:

> +++ b/src/backend/access/nbtree/README
> @@ -590,6 +590,10 @@ original search scankey is consulted as each index entry 
> is sequentially
>  scanned to decide whether to return the entry and whether the scan can
>  stop (see _bt_checkkeys()).
>
> +We use term "pivot" index tuples to distinguish tuples which don't point
> +to heap tuples, but rather used for tree navigation.  Pivot tuples includes
> +all tuples on non-leaf pages and high keys on leaf pages.

I like what you came up with, and where you put it, but I would add
another few sentences: "Note that pivot index tuples are only used to
represent which part of the key space belongs on each page, and can
have attribute values copied from non-pivot tuples that were deleted
and killed by VACUUM some time ago. In principle, we could truncate
away attributes that are not needed for a page high key during a leaf
page split, provided that the remaining attributes distinguish the
last index tuple on the post-split left page as belonging on the left
page, and the first index tuple on the post-split right page as
belonging on the right page. This optimization is sometimes called
suffix truncation, and may appear in a future release. Since the high
key is subsequently reused as the downlink in the parent page for the
new right page, suffix truncation can increase index fan-out
considerably by keeping pivot tuples short. INCLUDE indexes similarly
truncate away non-key attributes at the time of a leaf page split,
increasing fan-out."

> Good point.  Tests with "heapallindexed" were added.  I also find that it's
> useful to
> check both index built by sorting, and index built by insertions, because
> there are
> different ways of forming tuples.

Right. It's a good cross-check for things like that. We'll have to
teach bt_tuple_present_callback() to normalize the representation in
some way for the BT_ALT_TID_NONPIVOT case in the future. But it
already talks about normalizing for reasons like this, so that's okay.

* I think you should add a note about BT_ALT_TID_NONPIVOT to
bt_tuple_present_callback(), though. If it cannot be sure that every
non-pivot tuple will have the same representation, amcheck will have
to normalize to the most flexible representation before hashing.

> Ok.  I've tried to remove both assertions and "+1" hack.  That works
> for me.  However, I've to touch a lot of places, not sure if that's a
> problem.

Looks good to me. If it makes an assertion fail, that's probably a
good thing, because it would have been broken before anyway.

* You missed this comment, which is now not accurate:

> + * It's possible that index tuple has zero attributes (leftmost item of
> + * iternal page).  And we have assertion that offset number is greater or 
> equal
> + * to 1.  This is why we store (number_of_attributes + 1) in offset number.
> + */

I can see that it is actually 0 for a minus infinity item, which is good.

> I wrote some comment there.  Please, check it.

The nbtsort.c comments could maybe do with some tweaks from a native
speaker, but look correct.

> Regarding !P_ISLEAF(), I think we should check every item on both
> leaf and non-leaf pages.  I that is how code now works unless I'm missing
> something.

It does, and should. Thanks.

> Thanks for pointing.  Since there are now three cases including handling of
> "minus infinity" items, comment is now split to three.

That looks good. Thanks.

Right now, it looks like every B-Tree index could use
INDEX_ALT_TID_MASK, regardless of whether or not it's an INCLUDE
index. I think that that's fine, but let's say so in the paragraph
that introduces INDEX_ALT_TID_MASK. This patch establishes that any
nbtree pivot tuple could have INDEX_ALT_TID_MASK set, and that's
something that can be expected. It's also something that might not be
set when pg_upgrade was used, but that's fine too.

> I don't seerelations between this patchset and SSI.  We just
> change representation of some index tuples in pages.  However,
> we didn't change the the order of page modification, the order
> of page lookup and so on.  Yes, we change size of some tuples,
> but B-tree already worked with tuples of variable sizes.  So, the fact
> that tuples now have different size shouldn't affect SSI.  Right now,
> I'm not sure if CheckForSerializableConflictIn() just above the
> _bt_doinsert() is good idea.  But even if so, I think that should be
> a subject of separate patch.

My point was that that nothing changes, because we already use what
_bt_doinsert() calls the "first valid" page. Maybe just add: "(This
reasoning also applies to INCLUDE indexes, whose extra attributes are
not considered part of the key space.)".

That's it for today.
-- 
Peter Geoghegan

Re: WIP: Covering + unique indexes.

Reply via email to