On Mon, Apr 21, 2025 at 6:34 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > > On Mon, Apr 21, 2025 at 7:24 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > On Sun, Apr 20, 2025 at 6:31 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > > > > > > > > PR target/102294 > > > > > PR target/119596 > > > > > * config/i386/x86-tune-costs.h (generic_memcpy): Updated. > > > > > (generic_memset): Likewise. > > > > > (generic_cost): Change CLEAR_RATIO to 17. > > > > > * config/i386/x86-tune.def > > > > > (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): > > > > > Add m_GENERIC. > > > > > > > > Looking through the PRs, there they are primarily about CLEAR_RATIO > > > > being lower than on clang which makes us to produce slower (but smaller) > > > > initialization sequence for blocks of certain size. > > > > It seems Kenrel is discussed there too (-mno-sse). > > > > > > > > Bumping it up for SSE makes sense provided that SSE codegen does not > > > > suffer from the long $0 immediates. I would say it is OK also for > > > > -mno-sse provided speedups are quite noticeable, but it would be really > > > > nice to solve this incrementally. > > > > > > > > concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my understanding is > > > > that Intel chips likes stosb for small blocks, since they are not > > > > optimized for stosw/q. Zen seems to preffer stopsq over stosb for > > > > blocks up to 128 bytes. > > > > > > > > How does the loop version compare to stopsb for blocks in rage > > > > 1...128 bytes in Intel hardware? > > > > > > > > Since the case we prove block size to be small but we do not know a > > > > size, I think using loop or unrolled for blocks up to say 128 bytes > > > > may work well for both. > > > > > > > > Honza > > > > > > My patch has a 256 byte threshold. Are you suggesting changing it > > > to 128 bytes? > > > > > > > 256 bytes were selected since MOVE_RATIO and CLEAR_RATIO are > > 17 which is 16 * 16 (256) bytes. To lower the threshold to 128 bytes, > > MOVE_RATIO/CLEAR_RATIO will be changed to 9. Do we want to > > do that? > > Your patch does 3 things > 1) increases CLEAR_RATIO to 17, so we use up to 16 moves (SSE or > integer if -mno-sse is used) > 2) changes the algorithm choice tables for both memset/memcpy > 3) enables X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for generic > > My understanding is that it is primarily motivated by a testcases where > block size is known and we use rep movsq while sequence of move > instructions executes faster in micro-benchmark on Intel hardware. As > Andy mentioned, rep movsq is problematic because there is no small block > optimization (which I did not know and is indeed something to work > with). > > About 1: > > CLEAR_RATIO is bit of a compromise between code size and speed. > i.e. 16 times mov $0, mem is approx 128 bytes, while rep stosq > 3 bytes + some setup to load data to specific registers. > Call and loop are also shorter. > > We originally put CLEAR_RATIO < MOVE_RATIO based on observation that > mov $0, mem > is longer in encoding than > mov mem, mem > and there was a plan to implement optimization to avoid long immediates > in moves, but it did not materialize (yet). With SSE this problem > disappears since SSE stores does not have immediates anyway. > > I think we can increase CLEAR_RATIO if there is good reason, but if > we consider micro-benchmarks alone we will likely get a win for quite > large MOVE/CLEAR RATIO while on real code we want to take into account > the code size asspect too (i.e. do we want to increase code size of the > memset sequence 10 fold to get a speedup and how large it is?).
As you mentioned above, it isn't an issue with SSE. How about making CLEAR_RATIO SSE dependent? If SSE is enabled, use 17, otherwise use 6. > PR119596 has benchmarks comparing the sequence of moves to rep8: > > AMD Ryzen Threadripper 2990WX > > testcase:256_rep8 > min:27762317 max:27762317 total:27762317 > min:27739493 max:27739493 total:27739493 > min:27727869 max:27727869 total:27727869 > > testcase:256_unrolled > min:28374940 max:28374940 total:28374940 > min:28371060 max:28371060 total:28371060 > min:28358297 max:28358297 total:28358297 > > Here rep8 sequence wins a little > > Haswell: > > testcase:256_rep8 > min:14209786 max:14209786 total:14209786 > min:14192041 max:14192041 total:14192041 > min:14282288 max:14282288 total:14282288 > > testcase:256_unrolled > min:57857624 max:57857624 total:57857624 > min:58826876 max:58826876 total:58826876 > min:57539739 max:57539739 total:57539739 > > Here rep8 losses a lot, due to missing short string optimization. > > memset 256 bytes: > > AMD Ryzen Threadripper 2990WX > > testcase:256_rep8 > min:32776195 max:32776195 total:32776195 > min:32784246 max:32784246 total:32784246 > min:32838932 max:32838932 total:32838932 > > testcase:256_unrolled > min:34131140 max:34131140 total:34131140 > min:34088875 max:34088875 total:34088875 > min:34076293 max:34076293 total:34076293 > > testcase:256_rep8 > min:24953563 max:24953563 total:24953563 > min:24905210 max:24905210 total:24905210 > min:24877085 max:24877085 total:24877085 > > testcase:256_unrolled > min:58712755 max:58712755 total:58712755 > min:58853415 max:58853415 total:58853415 > min:58626856 max:58626856 total:58626856 > > Same story here. > > memset 56 bytes: > > AMD Ryzen Threadripper 2990WX > > testcase:56_rep8 > min:115632478 max:115632478 total:115632478 > min:115848126 max:115848126 total:115848126 > min:115762251 max:115762251 total:115762251 > > testcase:56_unrolled > min:152329392 max:152329392 total:152329392 > min:152526437 max:152526437 total:152526437 > min:152496941 max:152496941 total:152496941 > > I think it shows that rep8 should not be used on Haswell (and likely > other reasonably recent Intel chips). > > About 2 and 3: > > Your original description said that you use loop if size is known, which > is not what the algorithm choice tables does. They will simply pick rep > movsb for every < 256 and call otherwise unless user declared register > variables making movsb inaccessible. > What I said was 2. Inline only if data size is known to be <= 256. a. Use "rep movsb/stosb" with simple code sequence if the data size is a constant. b. Use loop if data size is not a constant. My patch does generate a loop for char *a; char *b; void t (int n) { if (n <= 256) __builtin_memcpy (a, b, n); } > Current tables chooses rep mosq for everything up to 8k. It seems that > Intel and Zen hardware is bit in disagreement here. Zen's rep movsb > seems is noticeably slower for smaller blocks <128 while Intel's rep > movsq is slow for those blocks too. I think we should avoid movsq. > I think first we need to pick algorithm tables that works well for both > recent Intel and Zen hardware (dropping consideration about buldozers, > since they are old enough these days) and once we have it, we can figure > out a reasonable MOVE and CLEAR_RATIO. Agreed. > I will re-run the stringop benchmarks on zens, but from I got on > zen 5 I would try using loop or unrolled loop for blocks in range 1...128 > (to avoid regression on zens where it seems slow) > and see how rep movsb runs for bigger blocks? > > Honza -- H.J.