Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -mtune=generic

H.J. Lu Mon, 21 Apr 2025 03:53:23 -0700

On Mon, Apr 21, 2025 at 6:34 PM Jan Hubicka <hubi...@ucw.cz> wrote:
>
> > On Mon, Apr 21, 2025 at 7:24 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > >
> > > On Sun, Apr 20, 2025 at 6:31 PM Jan Hubicka <hubi...@ucw.cz> wrote:
> > > >
> > > > >       PR target/102294
> > > > >       PR target/119596
> > > > >       * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
> > > > >       (generic_memset): Likewise.
> > > > >       (generic_cost): Change CLEAR_RATIO to 17.
> > > > >       * config/i386/x86-tune.def 
> > > > > (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
> > > > >       Add m_GENERIC.
> > > >
> > > > Looking through the PRs, there they are primarily about CLEAR_RATIO
> > > > being lower than on clang which makes us to produce slower (but smaller)
> > > > initialization sequence for blocks of certain size.
> > > > It seems Kenrel is discussed there too (-mno-sse).
> > > >
> > > > Bumping it up for SSE makes sense provided that SSE codegen does not
> > > > suffer from the long $0 immediates. I would say it is OK also for
> > > > -mno-sse provided speedups are quite noticeable, but it would be really
> > > > nice to solve this incrementally.
> > > >
> > > > concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my understanding is
> > > > that Intel chips likes stosb for small blocks, since they are not
> > > > optimized for stosw/q.  Zen seems to preffer stopsq over stosb for
> > > > blocks up to 128 bytes.
> > > >
> > > > How does the loop version compare to stopsb for blocks in rage
> > > > 1...128 bytes in Intel hardware?
> > > >
> > > > Since the case we prove block size to be small but we do not know a
> > > > size, I think using loop or unrolled for blocks up to say 128 bytes
> > > > may work well for both.
> > > >
> > > > Honza
> > >
> > > My patch has a 256 byte threshold.  Are you suggesting changing it
> > > to 128 bytes?
> > >
> >
> > 256 bytes were selected since MOVE_RATIO and CLEAR_RATIO are
> > 17 which is  16 * 16 (256) bytes.  To lower the threshold to 128 bytes,
> > MOVE_RATIO/CLEAR_RATIO will be changed to 9.  Do we want to
> > do that?
>
> Your patch does 3 things
>  1) increases CLEAR_RATIO to 17, so we use up to 16 moves (SSE or
>  integer if -mno-sse is used)
>  2) changes the algorithm choice tables for both memset/memcpy
>  3) enables X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for generic
>
> My understanding is that it is primarily motivated by a testcases where
> block size is known and we use rep movsq while sequence of move
> instructions executes faster in micro-benchmark on Intel hardware.  As
> Andy mentioned, rep movsq is problematic because there is no small block
> optimization (which I did not know and is indeed something to work
> with).
>
> About 1:
>
> CLEAR_RATIO is bit of a compromise between code size and speed.
> i.e. 16 times mov $0, mem is approx 128 bytes, while rep stosq
> 3 bytes + some setup to load data to specific registers.
> Call and loop are also shorter.
>
> We originally put CLEAR_RATIO < MOVE_RATIO based on observation that
>   mov $0, mem
> is longer in encoding than
>   mov mem, mem
> and there was a plan to implement optimization to avoid long immediates
> in moves, but it did not materialize (yet).  With SSE this problem
> disappears since SSE stores does not have immediates anyway.
>
> I think we can increase CLEAR_RATIO if there is good reason, but if
> we consider micro-benchmarks alone we will likely get a win for quite
> large MOVE/CLEAR RATIO while on real code we want to take into account
> the code size asspect too (i.e. do we want to increase code size of the
> memset sequence 10 fold to get a speedup and how large it is?).


As you mentioned above, it isn't an issue with SSE.   How about making
CLEAR_RATIO SSE dependent? If SSE is enabled, use 17, otherwise
use 6.

> PR119596 has benchmarks comparing the sequence of moves to rep8:
>
> AMD Ryzen Threadripper 2990WX
>
> testcase:256_rep8
> min:27762317 max:27762317 total:27762317
> min:27739493 max:27739493 total:27739493
> min:27727869 max:27727869 total:27727869
>
> testcase:256_unrolled
> min:28374940 max:28374940 total:28374940
> min:28371060 max:28371060 total:28371060
> min:28358297 max:28358297 total:28358297
>
> Here rep8 sequence wins a little
>
> Haswell:
>
> testcase:256_rep8
> min:14209786 max:14209786 total:14209786
> min:14192041 max:14192041 total:14192041
> min:14282288 max:14282288 total:14282288
>
> testcase:256_unrolled
> min:57857624 max:57857624 total:57857624
> min:58826876 max:58826876 total:58826876
> min:57539739 max:57539739 total:57539739
>
> Here rep8 losses a lot, due to missing short string optimization.
>
> memset 256 bytes:
>
> AMD Ryzen Threadripper 2990WX
>
> testcase:256_rep8
> min:32776195 max:32776195 total:32776195
> min:32784246 max:32784246 total:32784246
> min:32838932 max:32838932 total:32838932
>
> testcase:256_unrolled
> min:34131140 max:34131140 total:34131140
> min:34088875 max:34088875 total:34088875
> min:34076293 max:34076293 total:34076293
>
> testcase:256_rep8
> min:24953563 max:24953563 total:24953563
> min:24905210 max:24905210 total:24905210
> min:24877085 max:24877085 total:24877085
>
> testcase:256_unrolled
> min:58712755 max:58712755 total:58712755
> min:58853415 max:58853415 total:58853415
> min:58626856 max:58626856 total:58626856
>
> Same story here.
>
> memset 56 bytes:
>
> AMD Ryzen Threadripper 2990WX
>
> testcase:56_rep8
> min:115632478 max:115632478 total:115632478
> min:115848126 max:115848126 total:115848126
> min:115762251 max:115762251 total:115762251
>
> testcase:56_unrolled
> min:152329392 max:152329392 total:152329392
> min:152526437 max:152526437 total:152526437
> min:152496941 max:152496941 total:152496941
>
> I think it shows that rep8 should not be used on Haswell (and likely
> other reasonably recent Intel chips).
>
> About 2 and 3:
>
> Your original description said that you use loop if size is known, which
> is not what the algorithm choice tables does. They will simply pick rep
> movsb for every < 256 and call otherwise unless user declared register
> variables making movsb inaccessible.
>

What I said was

2. Inline only if data size is known to be <= 256.
   a. Use "rep movsb/stosb" with simple code sequence if the data size
      is a constant.
   b. Use loop if data size is not a constant.

My patch does generate a loop for

char *a;
char *b;
void t (int n)
{
  if (n <=  256)
    __builtin_memcpy (a, b, n);
}

> Current tables chooses rep mosq for everything up to 8k.  It seems that
> Intel and Zen hardware is bit in disagreement here.  Zen's rep movsb
> seems is noticeably slower for smaller blocks <128 while Intel's rep
> movsq is slow for those blocks too.

I think we should avoid movsq.

> I think first we need to pick algorithm tables that works well for both
> recent Intel and Zen hardware (dropping consideration about buldozers,
> since they are old enough these days) and once we have it, we can figure
> out a reasonable MOVE and CLEAR_RATIO.

Agreed.

> I will re-run the stringop benchmarks on zens, but from I got on
> zen 5 I would try using loop or unrolled loop for blocks in range 1...128
> (to avoid regression on zens where it seems slow)
> and see how rep movsb runs for bigger blocks?
>
> Honza



-- 
H.J.

Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -mtune=generic

Reply via email to