> On Mon, Apr 21, 2025 at 7:24 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > On Sun, Apr 20, 2025 at 6:31 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > > > > > > PR target/102294 > > > > PR target/119596 > > > > * config/i386/x86-tune-costs.h (generic_memcpy): Updated. > > > > (generic_memset): Likewise. > > > > (generic_cost): Change CLEAR_RATIO to 17. > > > > * config/i386/x86-tune.def > > > > (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): > > > > Add m_GENERIC. > > > > > > Looking through the PRs, there they are primarily about CLEAR_RATIO > > > being lower than on clang which makes us to produce slower (but smaller) > > > initialization sequence for blocks of certain size. > > > It seems Kenrel is discussed there too (-mno-sse). > > > > > > Bumping it up for SSE makes sense provided that SSE codegen does not > > > suffer from the long $0 immediates. I would say it is OK also for > > > -mno-sse provided speedups are quite noticeable, but it would be really > > > nice to solve this incrementally. > > > > > > concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my understanding is > > > that Intel chips likes stosb for small blocks, since they are not > > > optimized for stosw/q. Zen seems to preffer stopsq over stosb for > > > blocks up to 128 bytes. > > > > > > How does the loop version compare to stopsb for blocks in rage > > > 1...128 bytes in Intel hardware? > > > > > > Since the case we prove block size to be small but we do not know a > > > size, I think using loop or unrolled for blocks up to say 128 bytes > > > may work well for both. > > > > > > Honza > > > > My patch has a 256 byte threshold. Are you suggesting changing it > > to 128 bytes? > > > > 256 bytes were selected since MOVE_RATIO and CLEAR_RATIO are > 17 which is 16 * 16 (256) bytes. To lower the threshold to 128 bytes, > MOVE_RATIO/CLEAR_RATIO will be changed to 9. Do we want to > do that?
Your patch does 3 things 1) increases CLEAR_RATIO to 17, so we use up to 16 moves (SSE or integer if -mno-sse is used) 2) changes the algorithm choice tables for both memset/memcpy 3) enables X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for generic My understanding is that it is primarily motivated by a testcases where block size is known and we use rep movsq while sequence of move instructions executes faster in micro-benchmark on Intel hardware. As Andy mentioned, rep movsq is problematic because there is no small block optimization (which I did not know and is indeed something to work with). About 1: CLEAR_RATIO is bit of a compromise between code size and speed. i.e. 16 times mov $0, mem is approx 128 bytes, while rep stosq 3 bytes + some setup to load data to specific registers. Call and loop are also shorter. We originally put CLEAR_RATIO < MOVE_RATIO based on observation that mov $0, mem is longer in encoding than mov mem, mem and there was a plan to implement optimization to avoid long immediates in moves, but it did not materialize (yet). With SSE this problem disappears since SSE stores does not have immediates anyway. I think we can increase CLEAR_RATIO if there is good reason, but if we consider micro-benchmarks alone we will likely get a win for quite large MOVE/CLEAR RATIO while on real code we want to take into account the code size asspect too (i.e. do we want to increase code size of the memset sequence 10 fold to get a speedup and how large it is?). PR119596 has benchmarks comparing the sequence of moves to rep8: AMD Ryzen Threadripper 2990WX testcase:256_rep8 min:27762317 max:27762317 total:27762317 min:27739493 max:27739493 total:27739493 min:27727869 max:27727869 total:27727869 testcase:256_unrolled min:28374940 max:28374940 total:28374940 min:28371060 max:28371060 total:28371060 min:28358297 max:28358297 total:28358297 Here rep8 sequence wins a little Haswell: testcase:256_rep8 min:14209786 max:14209786 total:14209786 min:14192041 max:14192041 total:14192041 min:14282288 max:14282288 total:14282288 testcase:256_unrolled min:57857624 max:57857624 total:57857624 min:58826876 max:58826876 total:58826876 min:57539739 max:57539739 total:57539739 Here rep8 losses a lot, due to missing short string optimization. memset 256 bytes: AMD Ryzen Threadripper 2990WX testcase:256_rep8 min:32776195 max:32776195 total:32776195 min:32784246 max:32784246 total:32784246 min:32838932 max:32838932 total:32838932 testcase:256_unrolled min:34131140 max:34131140 total:34131140 min:34088875 max:34088875 total:34088875 min:34076293 max:34076293 total:34076293 testcase:256_rep8 min:24953563 max:24953563 total:24953563 min:24905210 max:24905210 total:24905210 min:24877085 max:24877085 total:24877085 testcase:256_unrolled min:58712755 max:58712755 total:58712755 min:58853415 max:58853415 total:58853415 min:58626856 max:58626856 total:58626856 Same story here. memset 56 bytes: AMD Ryzen Threadripper 2990WX testcase:56_rep8 min:115632478 max:115632478 total:115632478 min:115848126 max:115848126 total:115848126 min:115762251 max:115762251 total:115762251 testcase:56_unrolled min:152329392 max:152329392 total:152329392 min:152526437 max:152526437 total:152526437 min:152496941 max:152496941 total:152496941 I think it shows that rep8 should not be used on Haswell (and likely other reasonably recent Intel chips). About 2 and 3: Your original description said that you use loop if size is known, which is not what the algorithm choice tables does. They will simply pick rep movsb for every < 256 and call otherwise unless user declared register variables making movsb inaccessible. Current tables chooses rep mosq for everything up to 8k. It seems that Intel and Zen hardware is bit in disagreement here. Zen's rep movsb seems is noticeably slower for smaller blocks <128 while Intel's rep movsq is slow for those blocks too. I think first we need to pick algorithm tables that works well for both recent Intel and Zen hardware (dropping consideration about buldozers, since they are old enough these days) and once we have it, we can figure out a reasonable MOVE and CLEAR_RATIO. I will re-run the stringop benchmarks on zens, but from I got on zen 5 I would try using loop or unrolled loop for blocks in range 1...128 (to avoid regression on zens where it seems slow) and see how rep movsb runs for bigger blocks? Honza