Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

Peter Geoghegan Sun, 21 Apr 2019 17:47:16 -0700

Andres has suggested that I work on teaching nbtree to accommodate
variable-width, logical table identifiers, such as those required for
indirect indexes, or clustered indexes, where secondary indexes must
use a logical primary key value instead of a heap TID. I'm not
currently committed to working on this as a project, but I definitely
don't want to make it any harder. This has caused me to think about
the problem as it relates to the new on-disk representation for v4
nbtree indexes in Postgres 12. I do have a minor misgiving about one
particular aspect of what I came up with: The precise details of how
we represent heap TID in pivot tuples seems like it might make things
harder than they need to be for a future logical/varwidth table
identifier project. This probably isn't worth doing anything about
now, but it seems worth discussing now, just in case.


The macro BTreeTupleGetHeapTID() can be used to get a pointer to an
ItemPointerData (an ItemPointer) for the heap TID column if any is
available, regardless of whether the tuple is a non-pivot tuple
(points to the heap) or a pivot tuple (belongs in internal/branch
pages, and points to a block in the index, but needs to store heap TID
as well). In the non-pivot case the ItemPointer points to the start of
the tuple (raw IndexTuple field), while in the pivot case it points to
itup + IndexTupleSize() - sizeof(ItemPointerData). This interface
seems like the right thing to me; it's simple, low-context, works just
as well with INCLUDE indexes, and makes it fast to determine if there
are any truncated suffix attributes. However, I don't like the way the
alignment padding works -- there is often "extra" padding *between*
the last untrucated suffix attribute and the heap TID.

It seems like any MAXALIGN() padding should all be at the end -- the
only padding between tuples should be based on the *general*
requirement for the underlying data types, regardless of whether or
not we're dealing with the special heap TID representation in pivot
tuples. We should eliminate what could be viewed as a special case.
This approach is probably going to be easier to generalize later.
There can be a design where the logical/varwidth attribute can be
accessed either by using the usual index_getattr() stuff, or using an
interface like BTreeTupleGetHeapTID() to get to it quickly. We'd have
to store an offset to the final/identifier attribute in the header to
make that work, because we couldn't simply assume a particular width
(like 6 bytes), but that seems straightforward. (I imagine that
there'd be less difference between pivot and non-pivot tuples with
varwidth identifiers than there are currently with heap TID, since we
won't have to worry about pg_upgrade.)

nbtinsert.c is very MAXALIGN()-heavy, and currently always represents
that index tuples have a MAXALIGN()'d size, but that doesn't seem
necessary or desirable to me. After all, we don't do that within
heapam -- we can just rely on the bufpage.c routines to allocate a
MAXALIGN()'d space for the whole tuple, while still making the lp_len
field in the line pointer use the original size (i.e. size with
un-MAXALIGN()'ed tuple data area). I've found that it's quite possible
to get the nbtree code to store the tuple size (lp_len and redundant
IndexTupleSize() representation) this way, just like heapam always
has. This has some useful consequences: BTreeTupleGetHeapTID()
continues to work with the special pivot tuple representation, while
_bt_truncate() never "puts padding in the wrong place" when it must
add a heap TID due to there being many duplicates, and split point
that avoids doing that (that "truncates the heap TID attribute"). I
could make this work without breaking the regression tests in about 10
minutes, which is at least encouraging (it was a bit tricky, though).

This also results in an immediate though small benefit for v4 nbtree
indexes: _bt_truncate() produces smaller pivot tuples in a few cases.
For example, indexes with one or two boolean fields will have pivot
tuples that are 15 bytes and 16 bytes in length respectively,
occupying 16 bytes of tuple space on internal pages. The saving comes
because we can use the alignment padding hole, that was empty in the
original non-pivot index tuple that the new pivot tuple is to be
formed from. Currently, the size of these pivot tuples would be 24
bytes, so we're occasionally saving a MAXALIGN() quantum in space this
way. It is unlikely that anyone would actually care very much about
these kinds of space savings, but at the same time it feels more
elegant to me. The heap TID may not have a pg_attribute entry, but
ISTM that the on-disk representation should not have padding "in the
wrong place", on general principle.

Thoughts?
--
Peter Geoghegan

Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

Reply via email to