On Thu, Jan 9, 2025 at 10:12 PM Kees Cook <k...@kernel.org> wrote: > > On Thu, Jan 09, 2025 at 09:52:31PM +0100, Mateusz Guzik wrote: > > On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote: > > > On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote: > > > > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote: > > > > > kernel test robot noticed a 17.3% improvement of > > > > > vm-scalability.throughput on: > > > > > > > > > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide > > > > > run-time copy size from value range tracking") > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > > > > > > > > Well that is unexpected. There should be no binary output difference > > > > with that patch. I will investigate... > > > > > > It looks like hiding the size value from GCC has the side-effect of > > > breaking memcpy inlining in many places. I would expect this to make > > > things _slower_, though. O_o > > I think it's disabling value-range-based inlining, I'm trying to > construct some tests... > > > This depends on what was emitted in place and what CPU is executing it. > > > > Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does > > not have FSRM and the size is low enough, then such code can indeed be > > slower than suffering a call to memcpy (which does not issue rep mov). > > > > I had seen gcc go to great pains to align a buffer for rep movsq even > > when it was guaranteed to not be necessary for example. > > > > Can you disasm an example affected spot? > > I tried to find the most self-contained example I could, and I ended up > with: > > static void ipv6_rpl_addr_decompress(struct in6_addr *dst, > const struct in6_addr *daddr, > const void *post, unsigned char pfx) > { > memcpy(dst, daddr, pfx); > memcpy(&dst->s6_addr[pfx], post, IPV6_PFXTAIL_LEN(pfx)); > } >
Well I did what I should have from the get go and took the liberty of looking at the profile. %stddev %change %stddev \ | \ [snip] 0.00 +6.5 6.54 ± 66% perf-profile.calltrace.cycles-pp.memcpy_orig.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev Disassembling copy_page_from_iter_atomic *prior* to the change indeed reveals rep movsq as I suspected (second to last instruction): <+919>: mov (%rax),%rdx <+922>: lea 0x8(%rsi),%rdi <+926>: and $0xfffffffffffffff8,%rdi <+930>: mov %rdx,(%rsi) <+933>: mov %r8d,%edx <+936>: mov -0x8(%rax,%rdx,1),%rcx <+941>: mov %rcx,-0x8(%rsi,%rdx,1) <+946>: sub %rdi,%rsi <+949>: mov %rsi,%rdx <+952>: sub %rsi,%rax <+955>: lea (%r8,%rdx,1),%ecx <+959>: mov %rax,%rsi <+962>: shr $0x3,%ecx <+965>: rep movsq %ds:(%rsi),%es:(%rdi) <+968>: jmp 0xffffffff819157c5 <copy_page_from_iter_atomic+869> With the reported patch this is a call to memcpy. This is the guy: static __always_inline size_t memcpy_from_iter(void *iter_from, size_t progress, size_t len, void *to, void *priv2) { memcpy(to + progress, iter_from, len); return 0; } I don't know what the specific bench is doing, I'm assuming passed values were low enough that the overhead of spinning up rep movsq took over. gcc should retain the ability to optimize this, except it needs to be convinced to not emit rep movsq for variable sizes (and instead call memcpy). For user memory access there is a bunch of hackery to inline rep mov for CPUs where it does not suck for small sizes (see rep_movs_alternative). Someone(tm) should port it over to memcpy handling as well. The expected state would be that for sizes known at compilation time it rolls with movs as needed (no rep), otherwise emits the magic rep movs/memcpy invocation, except for when it would be tail-called. In your ipv6_rpl_addr_decompress example gcc went a little crazy, which I mentioned does happen. However, most of the time it is doing a good job instead and a now generated call to memcpy should make things slower. I presume these spots are merely not being benchmarked here. Note that going from inline movs (no rep) to a call to memcpy which does movs (again no rep) comes with a "mere" function call overhead, which is a different beast than spinning up rep movs on CPUs without FSRM. That is to say, contrary to the report above, I believe the change is in fact a regression which just so happened to make things faster for a specific case. The unintended speed up can be achieved without regressing anything else by taming the craziness. Reading the commit log I don't know what the way out is, perhaps you could rope in some gcc folk to ask? Screwing with optimization to not see a warning is definitely not the best option. -- Mateusz Guzik <mjguzik gmail.com>