On Sun, Mar 25, 2018 at 1:47 AM, Peter Geoghegan <p...@bowt.ie> wrote:
> I was going to say that you could just treat the low bit in the t_tid > offset as representing "see catalog entry". My first idea was that > nothing would have to change about the existing format, since internal > page items already have only the low bit set within their offset. > However, I now see that that won't really work, because we don't > change the offset in high keys when they're copied from a real item > during a page split. Whatever we do, it has to work equally well for > all "separator keys" -- that is, it must work for both downlinks in > internal pages, and all high keys (including high keys at the leaf > level). > OK. > A good solution is to use the unused 13th t_bit. If hash can have a > INDEX_MOVED_BY_SPLIT_MASK, then nbtree can have a INDEX_ALT_TID_MASK. > This avoids a BTREE_VERSION bump, and allows us to deal with the > highkey offset issue. Actually, it's even more flexible than that -- > it can work with ordinary leaf tuples in the future, too. That is, we > can eventually implement prefix truncation and deduplication at the > leaf level using this representation, since there is nothing that > limits INDEX_ALT_TID_MASK IndexTuples to "separator keys". > > The main difference between this approach to leaf prefix > truncation/compression/deduplication, and the GIN entry tree's posting > list representation would be that it wouldn't have to be > super-optimized for duplicates, at the expense of more common case for > regular nbtree indexes -- having few or no duplicates. A btree_gin > index on pgbench_accounts(aid) looks very similar to an equivalent > nbtree index if you just compare internal pages from each, but they > look quite different at the leaf level, where GIN has 24 byte > IndexTuples instead of 16 bytes IndexTuples. Of course, this is > because the leaf pages have posting lists that can never be simple > heap pointer TIDs. > Right, btree_gin is much smaller than regular btree when there are a lot of duplicates. When there is no duplicates then btree_gin becomes larger than regular btree, because gin stores single item pointer less compact than btree. A secondary goal of this INDEX_ALT_TID_MASK representation should be > that it won't even be necessary to know that an IndexTuple is > contained within a leaf page rather than an index page (again, unlike > GIN). I'm pretty confident that we can have a truly universal > IndexTuple representation for nbtree, while supporting all of these > standard optimizations. > > Sorry for going off in a tangent, but I think it's somewhat necessary > to have a strategy here. Of course, we don't have to get everything > right now, but we should be looking in this direction whenever we talk > about on-disk nbtree changes. > So, as I get you're proposing to introduce INDEX_ALT_TID_MASK flag which would indicate that we're storing something special in the t_tid offset. And that should help us not only for covering indexes, but also for further btree enhancements including suffix truncation. What exactly do you propose to store into t_tid offset when INDEX_ALT_TID_MASK flag is set? Is it number of attributes in this particular index tuple? ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company