Andres has suggested that I work on teaching nbtree to accommodate variable-width, logical table identifiers, such as those required for indirect indexes, or clustered indexes, where secondary indexes must use a logical primary key value instead of a heap TID. I'm not currently committed to working on this as a project, but I definitely don't want to make it any harder. This has caused me to think about the problem as it relates to the new on-disk representation for v4 nbtree indexes in Postgres 12. I do have a minor misgiving about one particular aspect of what I came up with: The precise details of how we represent heap TID in pivot tuples seems like it might make things harder than they need to be for a future logical/varwidth table identifier project. This probably isn't worth doing anything about now, but it seems worth discussing now, just in case.
The macro BTreeTupleGetHeapTID() can be used to get a pointer to an ItemPointerData (an ItemPointer) for the heap TID column if any is available, regardless of whether the tuple is a non-pivot tuple (points to the heap) or a pivot tuple (belongs in internal/branch pages, and points to a block in the index, but needs to store heap TID as well). In the non-pivot case the ItemPointer points to the start of the tuple (raw IndexTuple field), while in the pivot case it points to itup + IndexTupleSize() - sizeof(ItemPointerData). This interface seems like the right thing to me; it's simple, low-context, works just as well with INCLUDE indexes, and makes it fast to determine if there are any truncated suffix attributes. However, I don't like the way the alignment padding works -- there is often "extra" padding *between* the last untrucated suffix attribute and the heap TID. It seems like any MAXALIGN() padding should all be at the end -- the only padding between tuples should be based on the *general* requirement for the underlying data types, regardless of whether or not we're dealing with the special heap TID representation in pivot tuples. We should eliminate what could be viewed as a special case. This approach is probably going to be easier to generalize later. There can be a design where the logical/varwidth attribute can be accessed either by using the usual index_getattr() stuff, or using an interface like BTreeTupleGetHeapTID() to get to it quickly. We'd have to store an offset to the final/identifier attribute in the header to make that work, because we couldn't simply assume a particular width (like 6 bytes), but that seems straightforward. (I imagine that there'd be less difference between pivot and non-pivot tuples with varwidth identifiers than there are currently with heap TID, since we won't have to worry about pg_upgrade.) nbtinsert.c is very MAXALIGN()-heavy, and currently always represents that index tuples have a MAXALIGN()'d size, but that doesn't seem necessary or desirable to me. After all, we don't do that within heapam -- we can just rely on the bufpage.c routines to allocate a MAXALIGN()'d space for the whole tuple, while still making the lp_len field in the line pointer use the original size (i.e. size with un-MAXALIGN()'ed tuple data area). I've found that it's quite possible to get the nbtree code to store the tuple size (lp_len and redundant IndexTupleSize() representation) this way, just like heapam always has. This has some useful consequences: BTreeTupleGetHeapTID() continues to work with the special pivot tuple representation, while _bt_truncate() never "puts padding in the wrong place" when it must add a heap TID due to there being many duplicates, and split point that avoids doing that (that "truncates the heap TID attribute"). I could make this work without breaking the regression tests in about 10 minutes, which is at least encouraging (it was a bit tricky, though). This also results in an immediate though small benefit for v4 nbtree indexes: _bt_truncate() produces smaller pivot tuples in a few cases. For example, indexes with one or two boolean fields will have pivot tuples that are 15 bytes and 16 bytes in length respectively, occupying 16 bytes of tuple space on internal pages. The saving comes because we can use the alignment padding hole, that was empty in the original non-pivot index tuple that the new pivot tuple is to be formed from. Currently, the size of these pivot tuples would be 24 bytes, so we're occasionally saving a MAXALIGN() quantum in space this way. It is unlikely that anyone would actually care very much about these kinds of space savings, but at the same time it feels more elegant to me. The heap TID may not have a pg_attribute entry, but ISTM that the on-disk representation should not have padding "in the wrong place", on general principle. Thoughts? -- Peter Geoghegan