[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-06-23 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #22 from GCC Commits --- The master branch has been updated by H.J. Lu : https://gcc.gnu.org/g:d073bb6cfc219d4b6c283a0b527ee88b42e640e0 commit r16-1643-gd073bb6cfc219d4b6c283a0b527ee88b42e640e0 Author: H.J. Lu Date: Thu Mar 18 1

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-10 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #21 from Mateusz Guzik --- Given the issues outline in 119703 and 119704 I decided to microbench 2 older uarchs with select sizes. Note a better quality test which does not merely microbenchmark memset or memcpy is above for one rea

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-06 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-04 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #19 from Mateusz Guzik --- The results in PR 95435 look suspicious to me, so I had a better look at the bench script and I'm confident it is bogus. The compiler emits ops sized 0..2 * n - 1, where n is the reported block size. For

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #18 from Mateusz Guzik --- Ok, I see. I think I also see the discrepancy here. When you bench "libcall", you are going to glibc with SIMD-enabled routines. In contrast, the kernel avoids SIMD for performance reasons and instead wi

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #17 from Uroš Bizjak --- (In reply to Alexander Monakov from comment #16) > Mateusz, please have a look at PR 95435 for the previous round of tuning for > AMD, there's a benchmarking script linked from there in PR 43052. FYI, this b

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Com

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 Andrew Pinski changed: What|Removed |Added Status|NEW |RESOLVED Resolution|---

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #15 from Mateusz Guzik --- so tl;dr Suggested action: don't use rep for sizes <= 256 with by default

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #14 from Mateusz Guzik --- So I reran the bench on AMD EPYC 9R14 and also experienced a win. To recap gcc emits rep movsq/stosq for sizes > 40. I'm replacing that with unrolled loops for sizes up to 256 and punting to actual funcs p

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #13 from Mateusz Guzik --- I see there is a significant disconnect here between what I meant with this problem report and your perspective, so I'm going to be more explicit. Of course for best performance on a given uarch you would

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread ak at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 ak at gcc dot gnu.org changed: What|Removed |Added Status|RESOLVED|NEW Resolution|DUPLICATE

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread ak at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #11 from ak at gcc dot gnu.org --- #define m_CORE_AVX512 (m_SKYLAKE_AVX512 | m_CANNONLAKE \ | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_CASCADELAKE \ | m_TIGERLAKE | m_COOPERLAKE | m_SAPPHIR

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-03 Thread ak at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 ak at gcc dot gnu.org changed: What|Removed |Added CC||ak at gcc dot gnu.org --- Commen

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-02 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #8 from Mateusz Guzik --- (In reply to Andrew Pinski from comment #6) > (In reply to Mateusz Guzik from comment #4) > > The gcc default for the generic target is poor. rep is known to be a problem > > on most uarchs. > > Is it though

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-02 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #6 from Andrew Pinski --- (In reply to Mateusz Guzik from comment #4) > The gcc default for the generic target is poor. rep is known to be a problem > on most uarchs. Is it though? Or is it only poor on Intel ones? With -mtune=inte

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-02 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 Andrew Pinski changed: What|Removed |Added Resolution|--- |DUPLICATE Status|WAITING

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-02 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #5 from Andrew Pinski --- >Benching based on the Linux kernel and the Sapphire Rapids CPU: With -mtune=sapphirerapids , GCC produces: ``` _Z4zeroP3foo: .LFB0: .cfi_startproc mov QWORD PTR [rdi], 0 mov

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-02 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #4 from Mateusz Guzik --- Sorry guys, I must have pressed something by accident and the bug submitted before I typed it out. Anyhow the crux is: (In reply to Andrew Pinski from comment #1) > This is 100% a tuning issue. The generic

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-02 Thread mjguzik at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #3 from Mateusz Guzik --- Normally inlined memset and memcpy ops use SIMD. However, kernel are built for with -mno-sse for performance reasons. For buffers up to 40 bytes gcc emits regular stores, which is fine. For sizes above tha

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-02 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 Andrew Pinski changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed|

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

2025-04-02 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596 --- Comment #1 from Andrew Pinski --- This is 100% a tuning issue. The generic tuning is tuned for a generic target. You could use -mtune= to get a better tuning for the processor you using.