Thank you for the comment! First off, I thought that I managed to eliminate the degradation observed on the previous versions, but significant degradation (1.1% slower) is still seen in on case.
Anyway, before sending the new patch, let met just answer for the comments. At Thu, 5 Nov 2020 11:09:09 +0200, Heikki Linnakangas <hlinn...@iki.fi> wrote in > On 19/11/2019 12:48, Kyotaro Horiguchi wrote: > > 1. Inserting a branch in > > SearchCatCacheInternal. (CatCache_Pattern_1.patch) > > This is the most straightforward way to add an alternative feature. > > pattern 1 | 8459.73 | 28.15 # 9% (>> 1%) slower than 7757.58 > > pattern 1 | 8504.83 | 55.61 > > pattern 1 | 8541.81 | 41.56 > > pattern 1 | 8552.20 | 27.99 > > master | 7757.58 | 22.65 > > master | 7801.32 | 20.64 > > master | 7839.57 | 25.28 > > master | 7925.30 | 38.84 > > It's so slow that it cannot be used. > > This is very surprising. A branch that's never taken ought to be > predicted by the CPU's branch-predictor, and be very cheap. (A) original test patch I naively thought that the code path is too short to bury the degradation of additional a few instructions. Actually I measured performance again with the same patch set on the current master and had the more or less the same result. master 8195.58ms, patched 8817.40 ms: +10.75% However, I noticed that the additional call was a recursive call and a jmp inserted for the recursive call seems taking significant time. After avoiding the recursive call, the difference reduced to +0.96% (master 8268.71ms : patched 8348.30ms) Just two instructions below are inserted in this case, which looks reasonable. 8720ff <+31>: cmpl $0xffffffff,0x4ba942(%rip) # 0xd2ca48 <catalog_cache_prune_min_age> 872106 <+38>: jl 0x872240 <SearchCatCache1+352> (call to a function) (C) inserting bare counter-update code without a branch > Do we actually need a branch there? If I understand correctly, the > point is to bump up a usage counter on the catcache entry. You could > increment the counter unconditionally, even if the feature is not > used, and avoid the branch that way. That change causes 4.9% degradation, which is worse than having a branch. master 8364.54ms, patched 8666.86ms (+4.9%) The additional instructions follow. + 8721ab <+203>: mov 0x30(%rbx),%eax # %eax = ct->naccess + 8721ae <+206>: mov $0x2,%edx + 8721b3 <+211>: add $0x1,%eax # %eax++ + 8721b6 <+214>: cmove %edx,%eax # if %eax == 0 then %eax = 2 <original code> + 8721bf <+223>: mov %eax,0x30(%rbx) # ct->naccess = %eax + 8721c2 <+226>: mov 0x4cfe9f(%rip),%rax # 0xd42068 <catcacheclock> + 8721c9 <+233>: mov %rax,0x38(%rbx) # ct->lastaccess = %rax (D) naively branching then updateing, again. Come to think of this, I measured the same with a branch again, specifically: (It showed siginificant degradation before, in my memory.) dlsit_move_head(bucket, &ct->cache_elem); + if (catalog_cache_prune_min_age < -1) # never be true + { + (counter update) + } And I had effectively the same numbers from both master and patched. master 8066.93ms, patched 8052.37ms (-0.18%) The above branching inserts the same two instructions with (B) into different place but the result differs, for a reason uncertain to me. + 8721bb <+203>: cmpl $0xffffffff,0x4bb886(%rip) # <catalog_cache_prune_min_age> + 8721c2 <+210>: jl 0x872208 <SearchCatCache1+280> I'm not sure why but the patched beats the master by a small difference. Anyway ths new result shows that compiler might have got smarter than before? (E) bumping up in ReleaseCatCache() (won't work) > Another thought is to bump up the usage counter in ReleaseCatCache(), > and only when the refcount reaches zero. That might be somewhat > cheaper, if it's a common pattern to acquire additional leases on an > entry that's already referenced. > > Yet another thought is to replace 'refcount' with an 'acquirecount' > and 'releasecount'. In SearchCatCacheInternal(), increment > acquirecount, and in ReleaseCatCache, increment releasecount. When > they are equal, the entry is not in use. Now you have a counter that > gets incremented on every access, with the same number of CPU > instructions in the hot paths as we have today. These don't work for negative caches, since the corresponding tuples are never released. (F) removing less-significant code. > Or maybe there are some other ways we could micro-optimize > SearchCatCacheInternal(), to buy back the slowdown that this feature Yeah, I thought of that in the beginning. (I removed dlist_move_head() at the time.) But the most difficult aspect of this approach is that I cannot tell whether the modification never cause degradation or not. > would add? For example, you could remove the "if (cl->dead) continue;" > check, if dead entries were kept out of the hash buckets. Or maybe the > catctup struct could be made slightly smaller somehow, so that it > would fit more comfortably in a single cache line. As a trial, I removed that code and added the ct->naccess code. master 8187.44ms, patched 8266.74ms (+1.0%) So the removal decreased the degradation by about 3.9% of the total time. > My point is that I don't think we want to complicate the code much for > this. All the indirection stuff seems over-engineered for this. Let's > find a way to keep it simple. Yes, agreed from the bottom of my heart. I aspire to find a simple way to avoid degradation. regars. -- Kyotaro Horiguchi NTT Open Source Software Center