At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinn...@iki.fi> wrote in > On 06/11/2020 10:24, Kyotaro Horiguchi wrote: > > Thank you for the comment! > > First off, I thought that I managed to eliminate the degradation > > observed on the previous versions, but significant degradation (1.1% > > slower) is still seen in on case. > > One thing to keep in mind with micro-benchmarks like this is that even > completely unrelated code changes can change the layout of the code in > memory, which in turn can affect CPU caching affects in surprising > ways. If you're lucky, you can see 1-5% differences just by adding a > function that's never called, for example, if it happens to move other > code in memory so that a some hot codepath or struct gets split across > CPU cache lines. It can be infuriating when benchmarking.
True. I sometimes had to make distclean to stabilize such benchmarks.. > > At Thu, 5 Nov 2020 11:09:09 +0200, Heikki Linnakangas > > <hlinn...@iki.fi> wrote in > > (A) original test patch > > I naively thought that the code path is too short to bury the > > degradation of additional a few instructions. Actually I measured > > performance again with the same patch set on the current master and > > had the more or less the same result. > > master 8195.58ms, patched 8817.40 ms: +10.75% > > However, I noticed that the additional call was a recursive call and a > > jmp inserted for the recursive call seems taking significant > > time. After avoiding the recursive call, the difference reduced to > > +0.96% (master 8268.71ms : patched 8348.30ms) > > Just two instructions below are inserted in this case, which looks > > reasonable. > > 8720ff <+31>: cmpl $0xffffffff,0x4ba942(%rip) # 0xd2ca48 > > <catalog_cache_prune_min_age> > > 872106 <+38>: jl 0x872240 <SearchCatCache1+352> (call to a function) > > That's interesting. I think a 1% degradation would be acceptable. > > I think we'd like to enable this feature by default though, so the > performance when it's enabled is also very important. > > > (C) inserting bare counter-update code without a branch > > > >> Do we actually need a branch there? If I understand correctly, the > >> point is to bump up a usage counter on the catcache entry. You could > >> increment the counter unconditionally, even if the feature is not > >> used, and avoid the branch that way. > > That change causes 4.9% degradation, which is worse than having a > > branch. > > master 8364.54ms, patched 8666.86ms (+4.9%) > > The additional instructions follow. > > + 8721ab <+203>: mov 0x30(%rbx),%eax # %eax = ct->naccess > > + 8721ae <+206>: mov $0x2,%edx > > + 8721b3 <+211>: add $0x1,%eax # %eax++ > > + 8721b6 <+214>: cmove %edx,%eax # if %eax == 0 then %eax = 2 > > <original code> > > + 8721bf <+223>: mov %eax,0x30(%rbx) # ct->naccess = %eax > > + 8721c2 <+226>: mov 0x4cfe9f(%rip),%rax # 0xd42068 <catcacheclock> > > + 8721c9 <+233>: mov %rax,0x38(%rbx) # ct->lastaccess = %rax > > Do you need the "ntaccess == 2" test? You could always increment the > counter, and in the code that uses ntaccess to decide what to evict, > treat all values >= 2 the same. > > Need to handle integer overflow somehow. Or maybe not: integer > overflow is so infrequent that even if a hot syscache entry gets > evicted prematurely because its ntaccess count wrapped around to 0, it > will happen so rarely that it won't make any difference in practice. Agreed. Ok, I have prioritized completely avoiding degradation on the normal path, but laxing that restriction to 1% or so makes the code far simpler and make the expiration path signifinicantly faster. Now the branch for counter-increment is removed. For similar branches for counter-decrement side in CatCacheCleanupOldEntries(), Min() is compiled into cmovbe and a branch was removed.