https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #22 from GCC Commits ---
The master branch has been updated by H.J. Lu :
https://gcc.gnu.org/g:d073bb6cfc219d4b6c283a0b527ee88b42e640e0
commit r16-1643-gd073bb6cfc219d4b6c283a0b527ee88b42e640e0
Author: H.J. Lu
Date: Thu Mar 18 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #21 from Mateusz Guzik ---
Given the issues outline in 119703 and 119704 I decided to microbench 2 older
uarchs with select sizes. Note a better quality test which does not merely
microbenchmark memset or memcpy is above for one rea
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment #
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #19 from Mateusz Guzik ---
The results in PR 95435 look suspicious to me, so I had a better look at the
bench script and I'm confident it is bogus.
The compiler emits ops sized 0..2 * n - 1, where n is the reported block size.
For
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #18 from Mateusz Guzik ---
Ok, I see.
I think I also see the discrepancy here.
When you bench "libcall", you are going to glibc with SIMD-enabled routines.
In contrast, the kernel avoids SIMD for performance reasons and instead wi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #17 from Uroš Bizjak ---
(In reply to Alexander Monakov from comment #16)
> Mateusz, please have a look at PR 95435 for the previous round of tuning for
> AMD, there's a benchmarking script linked from there in PR 43052.
FYI, this b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
--- Com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
Andrew Pinski changed:
What|Removed |Added
Status|NEW |RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #15 from Mateusz Guzik ---
so tl;dr
Suggested action: don't use rep for sizes <= 256 with by default
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #14 from Mateusz Guzik ---
So I reran the bench on AMD EPYC 9R14 and also experienced a win.
To recap gcc emits rep movsq/stosq for sizes > 40. I'm replacing that with
unrolled loops for sizes up to 256 and punting to actual funcs p
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #13 from Mateusz Guzik ---
I see there is a significant disconnect here between what I meant with this
problem report and your perspective, so I'm going to be more explicit.
Of course for best performance on a given uarch you would
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
ak at gcc dot gnu.org changed:
What|Removed |Added
Status|RESOLVED|NEW
Resolution|DUPLICATE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #11 from ak at gcc dot gnu.org ---
#define m_CORE_AVX512 (m_SKYLAKE_AVX512 | m_CANNONLAKE \
| m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_CASCADELAKE \
| m_TIGERLAKE | m_COOPERLAKE | m_SAPPHIR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
ak at gcc dot gnu.org changed:
What|Removed |Added
CC||ak at gcc dot gnu.org
--- Commen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #8 from Mateusz Guzik ---
(In reply to Andrew Pinski from comment #6)
> (In reply to Mateusz Guzik from comment #4)
> > The gcc default for the generic target is poor. rep is known to be a problem
> > on most uarchs.
>
> Is it though
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #6 from Andrew Pinski ---
(In reply to Mateusz Guzik from comment #4)
> The gcc default for the generic target is poor. rep is known to be a problem
> on most uarchs.
Is it though? Or is it only poor on Intel ones?
With -mtune=inte
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
Andrew Pinski changed:
What|Removed |Added
Resolution|--- |DUPLICATE
Status|WAITING
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #5 from Andrew Pinski ---
>Benching based on the Linux kernel and the Sapphire Rapids CPU:
With -mtune=sapphirerapids , GCC produces:
```
_Z4zeroP3foo:
.LFB0:
.cfi_startproc
mov QWORD PTR [rdi], 0
mov
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #4 from Mateusz Guzik ---
Sorry guys, I must have pressed something by accident and the bug submitted
before I typed it out.
Anyhow the crux is:
(In reply to Andrew Pinski from comment #1)
> This is 100% a tuning issue. The generic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #3 from Mateusz Guzik ---
Normally inlined memset and memcpy ops use SIMD.
However, kernel are built for with -mno-sse for performance reasons.
For buffers up to 40 bytes gcc emits regular stores, which is fine. For sizes
above tha
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
Andrew Pinski changed:
What|Removed |Added
Ever confirmed|0 |1
Last reconfirmed|
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #1 from Andrew Pinski ---
This is 100% a tuning issue. The generic tuning is tuned for a generic target.
You could use -mtune= to get a better tuning for the processor you using.
22 matches
Mail list logo