On 13/10/2025 2:06 pm, Jan Beulich wrote: > Along with Zen2 (which doesn't expose ERMS), both families reportedly > suffer from sub-optimal aliasing detection when deciding whether REP MOVSB > can actually be carried out the accelerated way. Therefore we want to > avoid its use in the common case of memcpy(); copy_page_hot() is fine, as > its two pointers are always going to be having the same low 5 bits.
I think this could be a bit clearer. How about this: ---8<--- Zen2 (which doesn't expose ERMS) through Zen4 have sub-optimal aliasing detection for REP MOVS, and fall back to a unit-at-a-time loop when the two pointers have differing bottom 5 bits. While both forms are affected, this makes REP MOVSB 8 times slower than REP MOVSQ. memcpy() has a high likelihood of encountering this slowpath, so avoid using REP MOVSB. This undoes the ERMS optimisation added in commit d6397bd0e11c which turns out to be an anti-optimisation on these microarchitectures. However, retain the use of ERMS-based REP MOVSB in other cases such as copy_page_hot() where there parameter alignment is known to avoid the slowpath. ---8<--- ? This at least gets us back to the 4.20 behaviour. ~Andrew
