On Sat, Feb 11, 2023 at 2:33 PM John Naylor <john.nay...@enterprisedb.com> wrote: > > I didn't get any closer to radix-tree regression,
Me neither. It seems that in v26, inserting chunks into node-32 is slow but needs more analysis. I'll share if I found something interesting. > but I did find some inefficiencies in tidstore_add_tids() that are worth > talking about first, addressed in a rough fashion in the attached .txt > addendums that I can clean up and incorporate later. > > To start, I can reproduce the regression with this test as well: > > select * from bench_tidstore_load(0, 10 * 1000 * 1000); > > v15 + v26 store + adjustments: > mem_allocated | load_ms > ---------------+--------- > 98202152 | 1676 > > v26 0001-0008 > mem_allocated | load_ms > ---------------+--------- > 98202032 | 1826 > > ...and reverting to the alternate way to update the parent didn't help: > > v26 0001-6, 0008, insert_inner w/ null parent > > mem_allocated | load_ms > ---------------+--------- > 98202032 | 1825 > > ...and I'm kind of glad that wasn't the problem, because going back to that > would be a pain for the shmem case. > > Running perf doesn't show anything much different in the proportions (note > that rt_set must have been inlined when declared locally in v26): > > v15 + v26 store + adjustments: > 65.88% postgres postgres [.] tidstore_add_tids > 10.74% postgres postgres [.] rt_set > 9.20% postgres postgres [.] palloc0 > 6.49% postgres postgres [.] rt_node_insert_leaf > > v26 0001-0008 > 78.50% postgres postgres [.] tidstore_add_tids > 8.88% postgres postgres [.] palloc0 > 6.24% postgres postgres [.] local_rt_node_insert_leaf > > v2699-0001: The first thing I noticed is that palloc0 is taking way more time > than it should, and it's because the compiler doesn't know the values[] array > is small. One reason we need to zero the array is to make the algorithm > agnostic about what order the offsets come in, as I requested in a previous > review. Thinking some more, I was way too paranoid about that. As long as > access methods scan the line pointer array in the usual way, maybe we can > just assert that the keys we create are in order, and zero any unused array > entries as we find them. (I admit I can't actually think of a reason we would > ever encounter offsets out of order.) I can think that something like traversing a HOT chain could visit offsets out of order. But fortunately we prune such collected TIDs before heap vacuum in heap case. > Also, we can keep track of the last key we need to consider for insertion > into the radix tree, and ignore the rest. That might shave a few cycles > during the exclusive lock when the max offset of an LP_DEAD item < 64 on a > given page, which I think would be common in the wild. I also got rid of the > special case for non-encoding, since shifting by zero should work the same > way. These together led to a nice speedup on the v26 branch: > > mem_allocated | load_ms > ---------------+--------- > 98202032 | 1386 > > v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass > to tid_to_key_off(). That's bad for tidstore_add_tids() because > ItemPointerSetBlockNumber() must do this in order to allow the struct to be > SHORTALIGN'd: > > static inline void > BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber) > { > blockId->bi_hi = blockNumber >> 16; > blockId->bi_lo = blockNumber & 0xffff; > } > > Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse > the above process: > > static inline BlockNumber > BlockIdGetBlockNumber(const BlockIdData *blockId) > { > return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) > blockId->bi_lo); > } > > There is no reason to do any of this if we're not reading/writing directly > to/from an on-disk tid etc. To avoid this, I created a new function > encode_key_off() [name could be better], which deals with the raw block > number that we already have. Then turn tid_to_key_off() into a wrapper around > that, since we still need the full conversion for tidstore_lookup_tid(). > > v2699-0003: Get rid of all the remaining special cases for encoding/or not. I > am unaware of the need to optimize that case or treat it in any way > differently. I haven't tested this on an installation with non-default > blocksize and didn't measure this separately, but 0002+0003 gives: > > mem_allocated | load_ms > ---------------+--------- > 98202032 | 1259 > > If these are acceptable, I can incorporate them into a later patchset. These are nice improvements! I agree with all changes. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com