Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -mtune=generic

Jan Hubicka Mon, 21 Apr 2025 03:35:19 -0700

> On Mon, Apr 21, 2025 at 7:24 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> >
> > On Sun, Apr 20, 2025 at 6:31 PM Jan Hubicka <hubi...@ucw.cz> wrote:
> > >
> > > >       PR target/102294
> > > >       PR target/119596
> > > >       * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
> > > >       (generic_memset): Likewise.
> > > >       (generic_cost): Change CLEAR_RATIO to 17.
> > > >       * config/i386/x86-tune.def 
> > > > (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
> > > >       Add m_GENERIC.
> > >
> > > Looking through the PRs, there they are primarily about CLEAR_RATIO
> > > being lower than on clang which makes us to produce slower (but smaller)
> > > initialization sequence for blocks of certain size.
> > > It seems Kenrel is discussed there too (-mno-sse).
> > >
> > > Bumping it up for SSE makes sense provided that SSE codegen does not
> > > suffer from the long $0 immediates. I would say it is OK also for
> > > -mno-sse provided speedups are quite noticeable, but it would be really
> > > nice to solve this incrementally.
> > >
> > > concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my understanding is
> > > that Intel chips likes stosb for small blocks, since they are not
> > > optimized for stosw/q.  Zen seems to preffer stopsq over stosb for
> > > blocks up to 128 bytes.
> > >
> > > How does the loop version compare to stopsb for blocks in rage
> > > 1...128 bytes in Intel hardware?
> > >
> > > Since the case we prove block size to be small but we do not know a
> > > size, I think using loop or unrolled for blocks up to say 128 bytes
> > > may work well for both.
> > >
> > > Honza
> >
> > My patch has a 256 byte threshold.  Are you suggesting changing it
> > to 128 bytes?
> >
> 
> 256 bytes were selected since MOVE_RATIO and CLEAR_RATIO are
> 17 which is  16 * 16 (256) bytes.  To lower the threshold to 128 bytes,
> MOVE_RATIO/CLEAR_RATIO will be changed to 9.  Do we want to
> do that?


Your patch does 3 things
 1) increases CLEAR_RATIO to 17, so we use up to 16 moves (SSE or
 integer if -mno-sse is used)
 2) changes the algorithm choice tables for both memset/memcpy
 3) enables X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for generic

My understanding is that it is primarily motivated by a testcases where
block size is known and we use rep movsq while sequence of move
instructions executes faster in micro-benchmark on Intel hardware.  As
Andy mentioned, rep movsq is problematic because there is no small block
optimization (which I did not know and is indeed something to work
with).

About 1:

CLEAR_RATIO is bit of a compromise between code size and speed.
i.e. 16 times mov $0, mem is approx 128 bytes, while rep stosq
3 bytes + some setup to load data to specific registers.
Call and loop are also shorter.

We originally put CLEAR_RATIO < MOVE_RATIO based on observation that
  mov $0, mem
is longer in encoding than
  mov mem, mem
and there was a plan to implement optimization to avoid long immediates
in moves, but it did not materialize (yet).  With SSE this problem
disappears since SSE stores does not have immediates anyway.

I think we can increase CLEAR_RATIO if there is good reason, but if
we consider micro-benchmarks alone we will likely get a win for quite
large MOVE/CLEAR RATIO while on real code we want to take into account
the code size asspect too (i.e. do we want to increase code size of the
memset sequence 10 fold to get a speedup and how large it is?).  

PR119596 has benchmarks comparing the sequence of moves to rep8:

AMD Ryzen Threadripper 2990WX

testcase:256_rep8
min:27762317 max:27762317 total:27762317
min:27739493 max:27739493 total:27739493
min:27727869 max:27727869 total:27727869

testcase:256_unrolled
min:28374940 max:28374940 total:28374940
min:28371060 max:28371060 total:28371060
min:28358297 max:28358297 total:28358297

Here rep8 sequence wins a little

Haswell:

testcase:256_rep8
min:14209786 max:14209786 total:14209786
min:14192041 max:14192041 total:14192041
min:14282288 max:14282288 total:14282288

testcase:256_unrolled
min:57857624 max:57857624 total:57857624
min:58826876 max:58826876 total:58826876
min:57539739 max:57539739 total:57539739

Here rep8 losses a lot, due to missing short string optimization.

memset 256 bytes:

AMD Ryzen Threadripper 2990WX

testcase:256_rep8
min:32776195 max:32776195 total:32776195
min:32784246 max:32784246 total:32784246
min:32838932 max:32838932 total:32838932

testcase:256_unrolled
min:34131140 max:34131140 total:34131140
min:34088875 max:34088875 total:34088875
min:34076293 max:34076293 total:34076293

testcase:256_rep8
min:24953563 max:24953563 total:24953563
min:24905210 max:24905210 total:24905210
min:24877085 max:24877085 total:24877085

testcase:256_unrolled
min:58712755 max:58712755 total:58712755
min:58853415 max:58853415 total:58853415
min:58626856 max:58626856 total:58626856

Same story here.

memset 56 bytes:

AMD Ryzen Threadripper 2990WX

testcase:56_rep8
min:115632478 max:115632478 total:115632478
min:115848126 max:115848126 total:115848126
min:115762251 max:115762251 total:115762251

testcase:56_unrolled
min:152329392 max:152329392 total:152329392
min:152526437 max:152526437 total:152526437
min:152496941 max:152496941 total:152496941

I think it shows that rep8 should not be used on Haswell (and likely
other reasonably recent Intel chips).

About 2 and 3:

Your original description said that you use loop if size is known, which
is not what the algorithm choice tables does. They will simply pick rep
movsb for every < 256 and call otherwise unless user declared register
variables making movsb inaccessible.

Current tables chooses rep mosq for everything up to 8k.  It seems that
Intel and Zen hardware is bit in disagreement here.  Zen's rep movsb
seems is noticeably slower for smaller blocks <128 while Intel's rep
movsq is slow for those blocks too.

I think first we need to pick algorithm tables that works well for both
recent Intel and Zen hardware (dropping consideration about buldozers,
since they are old enough these days) and once we have it, we can figure
out a reasonable MOVE and CLEAR_RATIO.

I will re-run the stringop benchmarks on zens, but from I got on
zen 5 I would try using loop or unrolled loop for blocks in range 1...128
(to avoid regression on zens where it seems slow)
and see how rep movsb runs for bigger blocks?

Honza

Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -mtune=generic

Reply via email to