> -----Original Message-----
> From: Jan Hubicka <hubi...@ucw.cz>
> Sent: Monday, April 21, 2025 6:35 PM
> To: H.J. Lu <hjl.to...@gmail.com>
> Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao <hongtao....@intel.com>;
> ubiz...@gmail.com
> Subject: Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -
> mtune=generic
> 
> > On Mon, Apr 21, 2025 at 7:24 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > >
> > > On Sun, Apr 20, 2025 at 6:31 PM Jan Hubicka <hubi...@ucw.cz> wrote:
> > > >
> > > > >       PR target/102294
> > > > >       PR target/119596
> > > > >       * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
> > > > >       (generic_memset): Likewise.
> > > > >       (generic_cost): Change CLEAR_RATIO to 17.
> > > > >       * config/i386/x86-tune.def
> (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
> > > > >       Add m_GENERIC.
> > > >
> > > > Looking through the PRs, there they are primarily about
> > > > CLEAR_RATIO being lower than on clang which makes us to produce
> > > > slower (but smaller) initialization sequence for blocks of certain size.
> > > > It seems Kenrel is discussed there too (-mno-sse).
> > > >
> > > > Bumping it up for SSE makes sense provided that SSE codegen does
> > > > not suffer from the long $0 immediates. I would say it is OK also
> > > > for -mno-sse provided speedups are quite noticeable, but it would
> > > > be really nice to solve this incrementally.
> > > >
> > > > concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my
> understanding
> > > > is that Intel chips likes stosb for small blocks, since they are
> > > > not optimized for stosw/q.  Zen seems to preffer stopsq over stosb
> > > > for blocks up to 128 bytes.
> > > >
> > > > How does the loop version compare to stopsb for blocks in rage
> > > > 1...128 bytes in Intel hardware?
> > > >
> > > > Since the case we prove block size to be small but we do not know
> > > > a size, I think using loop or unrolled for blocks up to say 128
> > > > bytes may work well for both.
> > > >
> > > > Honza
> > >
> > > My patch has a 256 byte threshold.  Are you suggesting changing it
> > > to 128 bytes?
> > >
> >
> > 256 bytes were selected since MOVE_RATIO and CLEAR_RATIO are
> > 17 which is  16 * 16 (256) bytes.  To lower the threshold to 128
> > bytes, MOVE_RATIO/CLEAR_RATIO will be changed to 9.  Do we want to do
> > that?
> 
> Your patch does 3 things
>  1) increases CLEAR_RATIO to 17, so we use up to 16 moves (SSE or  integer if 
> -mno-
> sse is used)
>  2) changes the algorithm choice tables for both memset/memcpy
>  3) enables X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for generic
> 
> My understanding is that it is primarily motivated by a testcases where block 
> size is
> known and we use rep movsq while sequence of move instructions executes faster
> in micro-benchmark on Intel hardware.  As Andy mentioned, rep movsq is
> problematic because there is no small block optimization (which I did not 
> know and
> is indeed something to work with).
> 
> About 1:
> 
> CLEAR_RATIO is bit of a compromise between code size and speed.
> i.e. 16 times mov $0, mem is approx 128 bytes, while rep stosq
> 3 bytes + some setup to load data to specific registers.
> Call and loop are also shorter.
> 
> We originally put CLEAR_RATIO < MOVE_RATIO based on observation that
>   mov $0, mem
> is longer in encoding than
>   mov mem, mem
> and there was a plan to implement optimization to avoid long immediates in
> moves, but it did not materialize (yet).  With SSE this problem disappears 
> since SSE
> stores does not have immediates anyway.
> 
> I think we can increase CLEAR_RATIO if there is good reason, but if we 
> consider
> micro-benchmarks alone we will likely get a win for quite large MOVE/CLEAR 
> RATIO
> while on real code we want to take into account the code size asspect too 
> (i.e. do
> we want to increase code size of the memset sequence 10 fold to get a speedup
> and how large it is?).
> 

Hi Honza,

I just created a patch to change the CLEAR_RATIO to 17 for -mtune=generic.  
There is nearly no performance and code size impact on CPU2017. And the code 
size of latest Linux kernel increased by 0.06% with the patch (built with 
defconfig).
I built spec-fewflow (Git clone 
https://github.com/kronbichler/spec-femflow.git) with/without the patch, the 
speed improved by 36% on ZENVER5 and 20% on ADL (p core),  the code size 
increased by 0.12%.

Option: -march=x86-64-v3 -mtune=generic -O2

The binary changes are similar as follow:
----------------------------------------------
spec-fewflow with base:
xor    %eax,%eax
mov    $0x18,%ecx
rep stos %rax,%es:(%rdi)

spec-fewflow with patch :
vpxor  %xmm0,%xmm0,%xmm0
vmovdqa %ymm0,(%rax)
vmovdqa %ymm0,-0xa0(%rax)
vmovdqa %ymm0,-0x80(%rax)
vmovdqa %ymm0,-0x60(%rax)
vmovdqa %ymm0,-0x40(%rax)
vmovdqa %ymm0,-0x20(%rax)
----------------------------------------------
Can we check in this patch?


Thanks,
Lili.

> PR119596 has benchmarks comparing the sequence of moves to rep8:
> 
> AMD Ryzen Threadripper 2990WX
> 
> testcase:256_rep8
> min:27762317 max:27762317 total:27762317
> min:27739493 max:27739493 total:27739493
> min:27727869 max:27727869 total:27727869
> 
> testcase:256_unrolled
> min:28374940 max:28374940 total:28374940
> min:28371060 max:28371060 total:28371060
> min:28358297 max:28358297 total:28358297
> 
> Here rep8 sequence wins a little
> 
> Haswell:
> 
> testcase:256_rep8
> min:14209786 max:14209786 total:14209786
> min:14192041 max:14192041 total:14192041
> min:14282288 max:14282288 total:14282288
> 
> testcase:256_unrolled
> min:57857624 max:57857624 total:57857624
> min:58826876 max:58826876 total:58826876
> min:57539739 max:57539739 total:57539739
> 
> Here rep8 losses a lot, due to missing short string optimization.
> 
> memset 256 bytes:
> 
> AMD Ryzen Threadripper 2990WX
> 
> testcase:256_rep8
> min:32776195 max:32776195 total:32776195
> min:32784246 max:32784246 total:32784246
> min:32838932 max:32838932 total:32838932
> 
> testcase:256_unrolled
> min:34131140 max:34131140 total:34131140
> min:34088875 max:34088875 total:34088875
> min:34076293 max:34076293 total:34076293
> 
> testcase:256_rep8
> min:24953563 max:24953563 total:24953563
> min:24905210 max:24905210 total:24905210
> min:24877085 max:24877085 total:24877085
> 
> testcase:256_unrolled
> min:58712755 max:58712755 total:58712755
> min:58853415 max:58853415 total:58853415
> min:58626856 max:58626856 total:58626856
> 
> Same story here.
> 
> memset 56 bytes:
> 
> AMD Ryzen Threadripper 2990WX
> 
> testcase:56_rep8
> min:115632478 max:115632478 total:115632478
> min:115848126 max:115848126 total:115848126
> min:115762251 max:115762251 total:115762251
> 
> testcase:56_unrolled
> min:152329392 max:152329392 total:152329392
> min:152526437 max:152526437 total:152526437
> min:152496941 max:152496941 total:152496941
> 
> I think it shows that rep8 should not be used on Haswell (and likely other
> reasonably recent Intel chips).
> 
> About 2 and 3:
> 
> Your original description said that you use loop if size is known, which is 
> not what
> the algorithm choice tables does. They will simply pick rep movsb for every < 
> 256
> and call otherwise unless user declared register variables making movsb
> inaccessible.
> 
> Current tables chooses rep mosq for everything up to 8k.  It seems that Intel 
> and
> Zen hardware is bit in disagreement here.  Zen's rep movsb seems is noticeably
> slower for smaller blocks <128 while Intel's rep movsq is slow for those 
> blocks too.
> 
> I think first we need to pick algorithm tables that works well for both 
> recent Intel
> and Zen hardware (dropping consideration about buldozers, since they are old
> enough these days) and once we have it, we can figure out a reasonable MOVE 
> and
> CLEAR_RATIO.
> 
> I will re-run the stringop benchmarks on zens, but from I got on zen 5 I 
> would try
> using loop or unrolled loop for blocks in range 1...128 (to avoid regression 
> on zens
> where it seems slow) and see how rep movsb runs for bigger blocks?
> 
> Honza

Reply via email to