Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

Peter Geoghegan Wed, 24 Apr 2019 10:44:43 -0700

On Wed, Apr 24, 2019 at 5:22 AM Robert Haas <robertmh...@gmail.com> wrote:
> If you drop or detach a partition, you can either (a) perform, as part
> of that operation, a scan of every global index to remove all
> references to the former partition, or (b) tell each global indexes
> that all references to that partition number ought to be regarded as
> dead index tuples.  (b) makes detaching partitions faster and (a)
> seems hard to make rollback-safe, so I'm guessing we'll end up with
> (b).


I agree that (b) is the way to go.

> We don't want people to be able to exhaust the supply of partition
> numbers the way they can exhaust the supply of attribute numbers by
> adding and dropping columns repeatedly.

I agree that a partition numbering system needs to be able to
accommodate arbitrarily-many partitions over time. It wouldn't have
occurred to me to do it any other way. It is far far easier to make
this work than it would be to retrofit varwidth attribute numbers. We
won't have to worry about the HeapTupleHeaderGetNatts()
representation. At the same time, nothing stops us from representing
partition numbers in a simpler though less space efficient way in
system catalogs.

The main point of having global indexes is to be able to push down the
partition number and use it during index scans. We can store the
partition number at the end of the tuple on leaf pages, so that it's
easily accessible (important for VACUUM), while continuing to use the
IndexTuple fields for heap TID. On internal pages, the IndexTuple
fields must be used for the downlink (block number of child), so both
partition number and heap TID have to go towards the end of the tuples
(this happens just with heap TID on Postgres 12). Of course, suffix
truncation will manage to consistently get rid of both in most cases,
especially when the global index is a unique index.

The hard part is how to do varwidth encoding for space-efficient
partition numbers while continuing to use IndexTuple fields for heap
TID on the leaf level, *and* also having a
BTreeTupleGetHeapTID()-style macro to get partition number without
walking the entire index tuple. I suppose you could make the byte at
the end of the tuple indicate that there are in fact 31 bits total
when its high bit is set -- otherwise it's a 7 bit integer. Something
like that may be the way to go. The alignment rules seem to make it
worthwhile to keep the heap TID in the tuple header; it seems
inherently necessary to have a MAXALIGN()'d tuple header, so finding a
way to consistently put the first MAXALIGN() quantum to good use seems
wise.

-- 
Peter Geoghegan

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

Reply via email to