https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596

--- Comment #3 from Mateusz Guzik <mjguzik at gmail dot com> ---
Normally inlined memset and memcpy ops use SIMD.

However, kernel are built for with -mno-sse for performance reasons.

For buffers up to 40 bytes gcc emits regular stores, which is fine. For sizes
above that it resorts to rep stosq/movsq which constitutes a performance
problem.

Both rep stosq and rep movsq are known to be slow to start and it is
recommended they are avoided for small sizes, although the specific cut off
point wildly differs based on uarch.

Benching based on the Linux kernel and the Sapphire Rapids CPU:

Stock kernel suffers the rep movsq/stosq respectively, while rebuilt kernel has
unrolled loops in place and a punt to the routine as needed:

-mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
-mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign

A microbenchmark issuing fstat(): the codepath encounters a memset of 168
bytes. Replacing rep stosq with a loop makes it go from 8.5 mln to 9 mln ops/s.

This can raise a concern about i-cache footprint.

A more involved test compiles a hello world in a loop, part of the perf problem
there is a routine which once more has an inline memcpy of similar size to the
above. Rebuilding boosts compilation rate by about 1.7% files per second while
also reducing time spent in kernel vs time spent in userspace. As in this is
not just a change good for a microbenchmark, rather rep stosq/movsq remains a
problem. Note Sapphire Rapids ships with FSRM and it did not help with this
one.

More details on that one:
https://lore.kernel.org/all/CAGudoHEV-PFSr-xKUx5GkTf4KasJc=annzqbkotnfvliskt...@mail.gmail.com/T/#m41038aef33289a47112b9a213b1a904a3f13eded

Here is a testcase:

struct foo {
        char buf[41];
};

#define memset __builtin_memset

void zero(struct foo *f)
{
        memset(f->buf, 0, sizeof(f->buf));
}

When compiled with -O2 -mno-sse gcc 14.2 emits the following rep stos:
   0:   f3 0f 1e fa             endbr64
   4:   48 c7 07 00 00 00 00    movq   $0x0,(%rdi)
   b:   48 89 f9                mov    %rdi,%rcx
   e:   48 8d 7f 08             lea    0x8(%rdi),%rdi
  12:   31 c0                   xor    %eax,%eax
  14:   48 c7 47 19 00 00 00    movq   $0x0,0x19(%rdi)
  1b:   00
  1c:   48 83 e7 f8             and    $0xfffffffffffffff8,%rdi
  20:   48 29 f9                sub    %rdi,%rcx
  23:   83 c1 29                add    $0x29,%ecx
  26:   c1 e9 03                shr    $0x3,%ecx
  29:   f3 48 ab                rep stos %rax,%es:(%rdi)
  2c:   c3                      ret

Looks like your cut off from regular ops to rep is at 40 bytes.

Suggested action: change the cutoff to above 256.

rep was always a problem and it seems it remains as such.

Reply via email to