On Wed, May 5, 2021 at 3:18 PM Matthias van de Meent <boekewurm+postg...@gmail.com> wrote: > I believe that the TID is the unique identifier of that tuple, within context. > > For normal indexes, the TID as supplied directly by the TableAM is > sufficient, as the context is that table. > For global indexes, this TID must include enough information to relate > it to the table the tuple originated from.
Clearly something like a partition identifier column is sometimes just like a regular user-visible column, though occasionally not like one -- whichever is useful to the implementation in each context. For example, we probably want to do predicate pushdown, maybe with real cataloged operators that access the column like any other user-created column (the optimizer knows about the column, which even has a pg_attribute entry). Note that we only ever access the TID column using an insertion scankey today -- so there are several ways in which the partition identifier really would be much more like a user column than tid/scantid ever was. The TID is a key column for most purposes as of Postgres 12 (at least internally). That didn't break all unique indexes due to the existence of non-unique TIDs across duplicates! Insertions that must call _bt_check_unique() can deal with the issue directly, by temporarily unsetting scantid. We can easily do roughly the same thing here: be slightly creative about how we interpret whether or not the partition identifier is "just another key column" across each context. This is also similar to the way the implementation is slightly creative about NULL values, which are not equal to any other value to the user, but are nevertheless just another value from the domain of indexed values to the nbtree implementation. Cleverly defining the semantics of keys to get better performance and to avoid the need for special case code is more or less a standard technique. > In the whole database, that would be the OID of the table + the TID as > supplied by the table. > > As such, the identifier of the logical row (which can be called the > TID), as stored in index tuples in global indexes, would need to > consist of the TableAM supplied TID + the (local) id of the table > containing the tuple. 2 points: 1. Clearly you need to use the partition identifier with the TID in order to look up the version in the table -- you need to use both together in global indexes. But it can still work in much the same way as it would in a standard index -- it's just that you handle that extra detail as well. That's what I meant by additive. 2. If a TID points to a version of a row (or whatever you want to call the generalized version of a HOT chain -- almost the same thing), then of course you can always map it back to the logical row. That must always be true. It is equally true within a global index. Points 1 and 2 above seem obvious to me...so I think we agree on that much. I just don't know how you go from here to "we need variable-width TIDs". In all sincerity, I am confused because to me it just seems as if you're asserting that it must be necessary to have variable width TIDs again and again, without ever getting around to justifying it. Or even trying to. > Assuming we're in agreement on that part, I > would think it would be consistent to put this in TID infrastructure, > such that all indexes that use such new TID infrastructure can be > defined to be global with only minimal effort. Abstract definitions can be very useful, but ultimately they're just tools. They're seldom useful as a starting point in my experience. I try to start with the reality on the ground, and perhaps arrive at some kind of abstract model or idea much later. > ZHeap states that it can implement stable TIDs within limits, as IIRC > it requires retail index deletion support for all indexes on the > updated columns of that table. Whether or not that's true is not at all clear. What is true is that the prototype version of zheap that we have as of today is notable in that it more or less allows the moral equivalent of a HOT chain to be arbitrarily long (or much longer, at least). To the best of my knowledge there is nothing about retail index tuple deletion in the design, except perhaps something vague and aspirational. > I fail to see why this same > infrastructure could not be used for supporting clustered tables, > while enforcing these limits only soft enforced in ZHeap (that is, > only allowing index AMs that support retail index tuple deletion). You're ignoring an ocean of complexity here. Principally the need to implement something like two-phase locking (key value locking) in indexes to make this work, but also the need to account for how fundamentally redefining TID breaks things. To say nothing of how this might affect crash recovery. > > If it was very clear that there would be *some* > > significant benefit then the costs might start to look reasonable. But > > there isn't. "Build it and they will come" is not at all convincing to > > me. > > Clustered tables / Index-oriented Tables are very useful for tables of > which most columns are contained in the PK, or otherwise are often > ordered by their PK. I'm well aware of the fact that clustered index based tables are sometimes more useful than heap-based tables. -- Peter Geoghegan