https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #3 from Mateusz Guzik <mjguzik at gmail dot com> --- Normally inlined memset and memcpy ops use SIMD. However, kernel are built for with -mno-sse for performance reasons. For buffers up to 40 bytes gcc emits regular stores, which is fine. For sizes above that it resorts to rep stosq/movsq which constitutes a performance problem. Both rep stosq and rep movsq are known to be slow to start and it is recommended they are avoided for small sizes, although the specific cut off point wildly differs based on uarch. Benching based on the Linux kernel and the Sapphire Rapids CPU: Stock kernel suffers the rep movsq/stosq respectively, while rebuilt kernel has unrolled loops in place and a punt to the routine as needed: -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign A microbenchmark issuing fstat(): the codepath encounters a memset of 168 bytes. Replacing rep stosq with a loop makes it go from 8.5 mln to 9 mln ops/s. This can raise a concern about i-cache footprint. A more involved test compiles a hello world in a loop, part of the perf problem there is a routine which once more has an inline memcpy of similar size to the above. Rebuilding boosts compilation rate by about 1.7% files per second while also reducing time spent in kernel vs time spent in userspace. As in this is not just a change good for a microbenchmark, rather rep stosq/movsq remains a problem. Note Sapphire Rapids ships with FSRM and it did not help with this one. More details on that one: https://lore.kernel.org/all/CAGudoHEV-PFSr-xKUx5GkTf4KasJc=annzqbkotnfvliskt...@mail.gmail.com/T/#m41038aef33289a47112b9a213b1a904a3f13eded Here is a testcase: struct foo { char buf[41]; }; #define memset __builtin_memset void zero(struct foo *f) { memset(f->buf, 0, sizeof(f->buf)); } When compiled with -O2 -mno-sse gcc 14.2 emits the following rep stos: 0: f3 0f 1e fa endbr64 4: 48 c7 07 00 00 00 00 movq $0x0,(%rdi) b: 48 89 f9 mov %rdi,%rcx e: 48 8d 7f 08 lea 0x8(%rdi),%rdi 12: 31 c0 xor %eax,%eax 14: 48 c7 47 19 00 00 00 movq $0x0,0x19(%rdi) 1b: 00 1c: 48 83 e7 f8 and $0xfffffffffffffff8,%rdi 20: 48 29 f9 sub %rdi,%rcx 23: 83 c1 29 add $0x29,%ecx 26: c1 e9 03 shr $0x3,%ecx 29: f3 48 ab rep stos %rax,%es:(%rdi) 2c: c3 ret Looks like your cut off from regular ops to rep is at 40 bytes. Suggested action: change the cutoff to above 256. rep was always a problem and it seems it remains as such.