On Sat, Feb 11, 2023 at 2:33 PM John Naylor
<john.nay...@enterprisedb.com> wrote:
>
> I didn't get any closer to radix-tree regression,

Me neither. It seems that in v26, inserting chunks into node-32 is
slow but needs more analysis. I'll share if I found something
interesting.

> but I did find some inefficiencies in tidstore_add_tids() that are worth 
> talking about first, addressed in a rough fashion in the attached .txt 
> addendums that I can clean up and incorporate later.
>
> To start, I can reproduce the regression with this test as well:
>
> select * from bench_tidstore_load(0, 10 * 1000 * 1000);
>
> v15 + v26 store + adjustments:
>  mem_allocated | load_ms
> ---------------+---------
>       98202152 |    1676
>
> v26 0001-0008
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |    1826
>
> ...and reverting to the alternate way to update the parent didn't help:
>
> v26 0001-6, 0008, insert_inner w/ null parent
>
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |    1825
>
> ...and I'm kind of glad that wasn't the problem, because going back to that 
> would be a pain for the shmem case.
>
> Running perf doesn't show anything much different in the proportions (note 
> that rt_set must have been inlined when declared locally in v26):
>
> v15 + v26 store + adjustments:
>   65.88%  postgres  postgres             [.] tidstore_add_tids
>   10.74%  postgres  postgres             [.] rt_set
>    9.20%  postgres  postgres             [.] palloc0
>    6.49%  postgres  postgres             [.] rt_node_insert_leaf
>
> v26 0001-0008
>   78.50%  postgres  postgres             [.] tidstore_add_tids
>    8.88%  postgres  postgres             [.] palloc0
>    6.24%  postgres  postgres             [.] local_rt_node_insert_leaf
>
> v2699-0001: The first thing I noticed is that palloc0 is taking way more time 
> than it should, and it's because the compiler doesn't know the values[] array 
> is small. One reason we need to zero the array is to make the algorithm 
> agnostic about what order the offsets come in, as I requested in a previous 
> review. Thinking some more, I was way too paranoid about that. As long as 
> access methods scan the line pointer array in the usual way, maybe we can 
> just assert that the keys we create are in order, and zero any unused array 
> entries as we find them. (I admit I can't actually think of a reason we would 
> ever encounter offsets out of order.)

I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.

> Also, we can keep track of the last key we need to consider for insertion 
> into the radix tree, and ignore the rest. That might shave a few cycles 
> during the exclusive lock when the max offset of an LP_DEAD item < 64 on a 
> given page, which I think would be common in the wild. I also got rid of the 
> special case for non-encoding, since shifting by zero should work the same 
> way. These together led to a nice speedup on the v26 branch:
>
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |    1386
>
> v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass 
> to tid_to_key_off(). That's bad for tidstore_add_tids() because 
> ItemPointerSetBlockNumber() must do this in order to allow the struct to be 
> SHORTALIGN'd:
>
> static inline void
> BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
> {
> blockId->bi_hi = blockNumber >> 16;
> blockId->bi_lo = blockNumber & 0xffff;
> }
>
> Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse 
> the above process:
>
> static inline BlockNumber
> BlockIdGetBlockNumber(const BlockIdData *blockId)
> {
> return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) 
> blockId->bi_lo);
> }
>
> There is no reason to do any of this if we're not reading/writing directly 
> to/from an on-disk tid etc. To avoid this, I created a new function 
> encode_key_off() [name could be better], which deals with the raw block 
> number that we already have. Then turn tid_to_key_off() into a wrapper around 
> that, since we still need the full conversion for tidstore_lookup_tid().
>
> v2699-0003: Get rid of all the remaining special cases for encoding/or not. I 
> am unaware of the need to optimize that case or treat it in any way 
> differently. I haven't tested this on an installation with non-default 
> blocksize and didn't measure this separately, but 0002+0003 gives:
>
>  mem_allocated | load_ms
> ---------------+---------
>       98202032 |    1259
>
> If these are acceptable, I can incorporate them into a later patchset.

These are nice improvements! I agree with all changes.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


Reply via email to