https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #32 from Mateusz Guzik ---
For non-simd asm you can do at most 8 bytes per one mov instruction.
Stock gcc resorts to rep movsq for sizes bigger than 40 bytes. Telling it to
not use rep movsq results in loops of 4 movsq instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #30 from Mateusz Guzik ---
(In reply to H.J. Lu from comment #29)
> (In reply to Mateusz Guzik from comment #28)
> > (In reply to H.J. Lu from comment #27)
> > > (In reply to Mateusz Guzik from comment #26)
> > > > 4 stores per loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #28 from Mateusz Guzik ---
(In reply to H.J. Lu from comment #27)
> (In reply to Mateusz Guzik from comment #26)
> > 4 stores per loop is best
>
> Do you have data to show it?
I used to, but I'm out of this game.
However, this is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #26 from Mateusz Guzik ---
4 stores per loop is best
it is libcalls after 256, which is fine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #24 from Mateusz Guzik ---
I got the thing compiled against top of git.
with this as a testcase:
void zero(char *buf)
{
__builtin_memset(buf, 0, SIZE);
}
compiled like so:
./xgcc -O2 -DSIZE=128 -mno-sse -c ~/zero.c && objdu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #21 from Mateusz Guzik ---
I presume H.J. Lu can readily compile gcc or even has one with the patch
around. I don't. On the other hand I provided a trivial testcase not requiring
any setup.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #19 from Mateusz Guzik ---
can you show me what this disassembles to?
note that kernel builds disable simd
see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596#c3 for a sample code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #17 from Mateusz Guzik ---
any plans to push this forward? this is affecting the linux kernel, see 119596
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #21 from Mateusz Guzik ---
Given the issues outline in 119703 and 119704 I decided to microbench 2 older
uarchs with select sizes. Note a better quality test which does not merely
microbenchmark memset or memcpy is above for one rea
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119703
Bug ID: 119703
Summary: x86: spurious branches for inlined memset in ranges
(40; 64) when requesting unrolled loops without simd
Product: gcc
Version: 15.0
Status: UNCON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119704
Bug ID: 119704
Summary: x86: partially disobeyed strategy rep-based request
for inlined memset
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: normal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #19 from Mateusz Guzik ---
The results in PR 95435 look suspicious to me, so I had a better look at the
bench script and I'm confident it is bogus.
The compiler emits ops sized 0..2 * n - 1, where n is the reported block size.
For
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #18 from Mateusz Guzik ---
Ok, I see.
I think I also see the discrepancy here.
When you bench "libcall", you are going to glibc with SIMD-enabled routines.
In contrast, the kernel avoids SIMD for performance reasons and instead wi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #15 from Mateusz Guzik ---
so tl;dr
Suggested action: don't use rep for sizes <= 256 with by default
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #14 from Mateusz Guzik ---
So I reran the bench on AMD EPYC 9R14 and also experienced a win.
To recap gcc emits rep movsq/stosq for sizes > 40. I'm replacing that with
unrolled loops for sizes up to 256 and punting to actual funcs p
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #13 from Mateusz Guzik ---
I see there is a significant disconnect here between what I meant with this
problem report and your perspective, so I'm going to be more explicit.
Of course for best performance on a given uarch you would
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
Bug ID: 119596
Summary: x86: too eager use of rep movsq/rep stosq for inlined
ops
Product: gcc
Version: 14.2.0
Status: UNCONFIRMED
Severity: normal
P
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #8 from Mateusz Guzik ---
(In reply to Andrew Pinski from comment #6)
> (In reply to Mateusz Guzik from comment #4)
> > The gcc default for the generic target is poor. rep is known to be a problem
> > on most uarchs.
>
> Is it though
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #4 from Mateusz Guzik ---
Sorry guys, I must have pressed something by accident and the bug submitted
before I typed it out.
Anyhow the crux is:
(In reply to Andrew Pinski from comment #1)
> This is 100% a tuning issue. The generic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #3 from Mateusz Guzik ---
Normally inlined memset and memcpy ops use SIMD.
However, kernel are built for with -mno-sse for performance reasons.
For buffers up to 40 bytes gcc emits regular stores, which is fine. For sizes
above tha
20 matches
Mail list logo