[Bug target/102294] memset expansion is sometimes slow for small sizes

2025-06-06 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 --- Comment #32 from Mateusz Guzik --- For non-simd asm you can do at most 8 bytes per one mov instruction. Stock gcc resorts to rep movsq for sizes bigger than 40 bytes. Telling it to not use rep movsq results in loops of 4 movsq instructions

[Bug target/102294] memset expansion is sometimes slow for small sizes

2025-06-06 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 --- Comment #30 from Mateusz Guzik --- (In reply to H.J. Lu from comment #29) > (In reply to Mateusz Guzik from comment #28) > > (In reply to H.J. Lu from comment #27) > > > (In reply to Mateusz Guzik from comment #26) > > > > 4 stores per loop

[Bug target/102294] memset expansion is sometimes slow for small sizes

2025-06-06 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 --- Comment #28 from Mateusz Guzik --- (In reply to H.J. Lu from comment #27) > (In reply to Mateusz Guzik from comment #26) > > 4 stores per loop is best > > Do you have data to show it? I used to, but I'm out of this game. However, this is

[Bug target/102294] memset expansion is sometimes slow for small sizes

2025-06-05 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 --- Comment #26 from Mateusz Guzik --- 4 stores per loop is best it is libcalls after 256, which is fine

[Bug target/102294] memset expansion is sometimes slow for small sizes

2025-06-05 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 --- Comment #24 from Mateusz Guzik --- I got the thing compiled against top of git. with this as a testcase: void zero(char *buf) { __builtin_memset(buf, 0, SIZE); } compiled like so: ./xgcc -O2 -DSIZE=128 -mno-sse -c ~/zero.c && objdu

[Bug target/102294] memset expansion is sometimes slow for small sizes

2025-06-05 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 --- Comment #21 from Mateusz Guzik --- I presume H.J. Lu can readily compile gcc or even has one with the patch around. I don't. On the other hand I provided a trivial testcase not requiring any setup.

[Bug target/102294] memset expansion is sometimes slow for small sizes

2025-06-05 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 --- Comment #19 from Mateusz Guzik --- can you show me what this disassembles to? note that kernel builds disable simd see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596#c3 for a sample code

[Bug target/102294] memset expansion is sometimes slow for small sizes

2025-06-05 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 --- Comment #17 from Mateusz Guzik --- any plans to push this forward? this is affecting the linux kernel, see 119596

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-10 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #21 from Mateusz Guzik --- Given the issues outline in 119703 and 119704 I decided to microbench 2 older uarchs with select sizes. Note a better quality test which does not merely microbenchmark memset or memcpy is above for one rea

[Bug c/119703] New: x86: spurious branches for inlined memset in ranges (40; 64) when requesting unrolled loops without simd

2025-04-09 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119703 Bug ID: 119703 Summary: x86: spurious branches for inlined memset in ranges (40; 64) when requesting unrolled loops without simd Product: gcc Version: 15.0 Status: UNCON

[Bug c/119704] New: x86: partially disobeyed strategy rep-based request for inlined memset

2025-04-09 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119704 Bug ID: 119704 Summary: x86: partially disobeyed strategy rep-based request for inlined memset Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-04 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #19 from Mateusz Guzik --- The results in PR 95435 look suspicious to me, so I had a better look at the bench script and I'm confident it is bogus. The compiler emits ops sized 0..2 * n - 1, where n is the reported block size. For

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #18 from Mateusz Guzik --- Ok, I see. I think I also see the discrepancy here. When you bench "libcall", you are going to glibc with SIMD-enabled routines. In contrast, the kernel avoids SIMD for performance reasons and instead wi

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #15 from Mateusz Guzik --- so tl;dr Suggested action: don't use rep for sizes <= 256 with by default

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #14 from Mateusz Guzik --- So I reran the bench on AMD EPYC 9R14 and also experienced a win. To recap gcc emits rep movsq/stosq for sizes > 40. I'm replacing that with unrolled loops for sizes up to 256 and punting to actual funcs p

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #13 from Mateusz Guzik --- I see there is a significant disconnect here between what I meant with this problem report and your perspective, so I'm going to be more explicit. Of course for best performance on a given uarch you would

[Bug c/119596] New: x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-02 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 Bug ID: 119596 Summary: x86: too eager use of rep movsq/rep stosq for inlined ops Product: gcc Version: 14.2.0 Status: UNCONFIRMED Severity: normal P

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-02 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #8 from Mateusz Guzik --- (In reply to Andrew Pinski from comment #6) > (In reply to Mateusz Guzik from comment #4) > > The gcc default for the generic target is poor. rep is known to be a problem > > on most uarchs. > > Is it though

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-02 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #4 from Mateusz Guzik --- Sorry guys, I must have pressed something by accident and the bug submitted before I typed it out. Anyhow the crux is: (In reply to Andrew Pinski from comment #1) > This is 100% a tuning issue. The generic

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-02 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #3 from Mateusz Guzik --- Normally inlined memset and memcpy ops use SIMD. However, kernel are built for with -mno-sse for performance reasons. For buffers up to 40 bytes gcc emits regular stores, which is fine. For sizes above tha