> -----Original Message----- > From: Jan Hubicka <hubi...@ucw.cz> > Sent: Monday, April 21, 2025 6:35 PM > To: H.J. Lu <hjl.to...@gmail.com> > Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao <hongtao....@intel.com>; > ubiz...@gmail.com > Subject: Re: [PATCH v2] x86: Update memcpy/memset inline strategies for - > mtune=generic > > > On Mon, Apr 21, 2025 at 7:24 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > On Sun, Apr 20, 2025 at 6:31 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > > > > > > > > PR target/102294 > > > > > PR target/119596 > > > > > * config/i386/x86-tune-costs.h (generic_memcpy): Updated. > > > > > (generic_memset): Likewise. > > > > > (generic_cost): Change CLEAR_RATIO to 17. > > > > > * config/i386/x86-tune.def > (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): > > > > > Add m_GENERIC. > > > > > > > > Looking through the PRs, there they are primarily about > > > > CLEAR_RATIO being lower than on clang which makes us to produce > > > > slower (but smaller) initialization sequence for blocks of certain size. > > > > It seems Kenrel is discussed there too (-mno-sse). > > > > > > > > Bumping it up for SSE makes sense provided that SSE codegen does > > > > not suffer from the long $0 immediates. I would say it is OK also > > > > for -mno-sse provided speedups are quite noticeable, but it would > > > > be really nice to solve this incrementally. > > > > > > > > concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my > understanding > > > > is that Intel chips likes stosb for small blocks, since they are > > > > not optimized for stosw/q. Zen seems to preffer stopsq over stosb > > > > for blocks up to 128 bytes. > > > > > > > > How does the loop version compare to stopsb for blocks in rage > > > > 1...128 bytes in Intel hardware? > > > > > > > > Since the case we prove block size to be small but we do not know > > > > a size, I think using loop or unrolled for blocks up to say 128 > > > > bytes may work well for both. > > > > > > > > Honza > > > > > > My patch has a 256 byte threshold. Are you suggesting changing it > > > to 128 bytes? > > > > > > > 256 bytes were selected since MOVE_RATIO and CLEAR_RATIO are > > 17 which is 16 * 16 (256) bytes. To lower the threshold to 128 > > bytes, MOVE_RATIO/CLEAR_RATIO will be changed to 9. Do we want to do > > that? > > Your patch does 3 things > 1) increases CLEAR_RATIO to 17, so we use up to 16 moves (SSE or integer if > -mno- > sse is used) > 2) changes the algorithm choice tables for both memset/memcpy > 3) enables X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for generic > > My understanding is that it is primarily motivated by a testcases where block > size is > known and we use rep movsq while sequence of move instructions executes faster > in micro-benchmark on Intel hardware. As Andy mentioned, rep movsq is > problematic because there is no small block optimization (which I did not > know and > is indeed something to work with). > > About 1: > > CLEAR_RATIO is bit of a compromise between code size and speed. > i.e. 16 times mov $0, mem is approx 128 bytes, while rep stosq > 3 bytes + some setup to load data to specific registers. > Call and loop are also shorter. > > We originally put CLEAR_RATIO < MOVE_RATIO based on observation that > mov $0, mem > is longer in encoding than > mov mem, mem > and there was a plan to implement optimization to avoid long immediates in > moves, but it did not materialize (yet). With SSE this problem disappears > since SSE > stores does not have immediates anyway. > > I think we can increase CLEAR_RATIO if there is good reason, but if we > consider > micro-benchmarks alone we will likely get a win for quite large MOVE/CLEAR > RATIO > while on real code we want to take into account the code size asspect too > (i.e. do > we want to increase code size of the memset sequence 10 fold to get a speedup > and how large it is?). >
Hi Honza, I just created a patch to change the CLEAR_RATIO to 17 for -mtune=generic. There is nearly no performance and code size impact on CPU2017. And the code size of latest Linux kernel increased by 0.06% with the patch (built with defconfig). I built spec-fewflow (Git clone https://github.com/kronbichler/spec-femflow.git) with/without the patch, the speed improved by 36% on ZENVER5 and 20% on ADL (p core), the code size increased by 0.12%. Option: -march=x86-64-v3 -mtune=generic -O2 The binary changes are similar as follow: ---------------------------------------------- spec-fewflow with base: xor %eax,%eax mov $0x18,%ecx rep stos %rax,%es:(%rdi) spec-fewflow with patch : vpxor %xmm0,%xmm0,%xmm0 vmovdqa %ymm0,(%rax) vmovdqa %ymm0,-0xa0(%rax) vmovdqa %ymm0,-0x80(%rax) vmovdqa %ymm0,-0x60(%rax) vmovdqa %ymm0,-0x40(%rax) vmovdqa %ymm0,-0x20(%rax) ---------------------------------------------- Can we check in this patch? Thanks, Lili. > PR119596 has benchmarks comparing the sequence of moves to rep8: > > AMD Ryzen Threadripper 2990WX > > testcase:256_rep8 > min:27762317 max:27762317 total:27762317 > min:27739493 max:27739493 total:27739493 > min:27727869 max:27727869 total:27727869 > > testcase:256_unrolled > min:28374940 max:28374940 total:28374940 > min:28371060 max:28371060 total:28371060 > min:28358297 max:28358297 total:28358297 > > Here rep8 sequence wins a little > > Haswell: > > testcase:256_rep8 > min:14209786 max:14209786 total:14209786 > min:14192041 max:14192041 total:14192041 > min:14282288 max:14282288 total:14282288 > > testcase:256_unrolled > min:57857624 max:57857624 total:57857624 > min:58826876 max:58826876 total:58826876 > min:57539739 max:57539739 total:57539739 > > Here rep8 losses a lot, due to missing short string optimization. > > memset 256 bytes: > > AMD Ryzen Threadripper 2990WX > > testcase:256_rep8 > min:32776195 max:32776195 total:32776195 > min:32784246 max:32784246 total:32784246 > min:32838932 max:32838932 total:32838932 > > testcase:256_unrolled > min:34131140 max:34131140 total:34131140 > min:34088875 max:34088875 total:34088875 > min:34076293 max:34076293 total:34076293 > > testcase:256_rep8 > min:24953563 max:24953563 total:24953563 > min:24905210 max:24905210 total:24905210 > min:24877085 max:24877085 total:24877085 > > testcase:256_unrolled > min:58712755 max:58712755 total:58712755 > min:58853415 max:58853415 total:58853415 > min:58626856 max:58626856 total:58626856 > > Same story here. > > memset 56 bytes: > > AMD Ryzen Threadripper 2990WX > > testcase:56_rep8 > min:115632478 max:115632478 total:115632478 > min:115848126 max:115848126 total:115848126 > min:115762251 max:115762251 total:115762251 > > testcase:56_unrolled > min:152329392 max:152329392 total:152329392 > min:152526437 max:152526437 total:152526437 > min:152496941 max:152496941 total:152496941 > > I think it shows that rep8 should not be used on Haswell (and likely other > reasonably recent Intel chips). > > About 2 and 3: > > Your original description said that you use loop if size is known, which is > not what > the algorithm choice tables does. They will simply pick rep movsb for every < > 256 > and call otherwise unless user declared register variables making movsb > inaccessible. > > Current tables chooses rep mosq for everything up to 8k. It seems that Intel > and > Zen hardware is bit in disagreement here. Zen's rep movsb seems is noticeably > slower for smaller blocks <128 while Intel's rep movsq is slow for those > blocks too. > > I think first we need to pick algorithm tables that works well for both > recent Intel > and Zen hardware (dropping consideration about buldozers, since they are old > enough these days) and once we have it, we can figure out a reasonable MOVE > and > CLEAR_RATIO. > > I will re-run the stringop benchmarks on zens, but from I got on zen 5 I > would try > using loop or unrolled loop for blocks in range 1...128 (to avoid regression > on zens > where it seems slow) and see how rep movsb runs for bigger blocks? > > Honza