Thanks for the feedback! I updated the patch based on your comments, more detailed comments inline below. The updated version was bootstrapped and tested again, no regression. Best, Jennifer
0001-AArch64-Fuse-CMP-CSEL-and-CMP-CSET-for-mcpu-neoverse.patch
Description: Binary data
> On 25 Jul 2024, at 14:49, Kyrylo Tkachov <ktkac...@nvidia.com> wrote: > > > >> On 25 Jul 2024, at 13:58, Richard Sandiford <richard.sandif...@arm.com> >> wrote: >> >> External email: Use caution opening links or attachments >> >> >> Jennifer Schmitz <jschm...@nvidia.com> writes: >>> Thank you for the feedback. I added checks for SCALAR_INT_MODE_P for the >>> reg operands of the compare and if-then-else expressions. As it is not >>> legal to have different modes in the operand registers, I only added one >>> check for each of the expressions. >>> The updated patch was bootstrapped and tested again. >>> Best, >>> Jennifer >>> >>> From 8da609be99fece8130cf1429bd938b2a26c6672b Mon Sep 17 00:00:00 2001 >>> From: Jennifer Schmitz <jschm...@nvidia.com> >>> Date: Wed, 24 Jul 2024 06:13:59 -0700 >>> Subject: [PATCH] aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2 >>> >>> According to the Neoverse V2 Software Optimization Guide (section 4.14), the >>> instruction pairs CMP+CSEL and CMP+CSET can be fused, which had not been >>> implemented so far. This patch implements and tests the two fusion pairs. >>> >>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>> regression. >>> There was also no non-noise impact on SPEC CPU2017 benchmark. >>> OK for mainline? >>> >>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> >>> >>> gcc/ >>> >>> * config/aarch64/aarch64.cc (aarch_macro_fusion_pair_p): Implement >>> fusion logic. >>> * config/aarch64/aarch64-fusion-pairs.def (cmp+csel): New entry. >>> (cmp+cset): Likewise. >>> * config/aarch64/tuning_models/neoversev2.h: Enable logic in >>> field fusible_ops. >>> >>> gcc/testsuite/ >>> >>> * gcc.target/aarch64/fuse_cmp_csel.c: New test. >>> * gcc.target/aarch64/fuse_cmp_cset.c: Likewise. >> >> Thanks for the update. >> >> It looks from a quick scan like the main three instructions associated >> with single-set integer COMPAREs are CMP, CMN and TST. TST could be >> distinguished from CMP and CMN based on get_attr_type (), although it >> looks like: >> >> (define_insn "*and<mode>_compare0" >> [(set (reg:CC_Z CC_REGNUM) >> (compare:CC_Z >> (match_operand:SHORT 0 "register_operand" "r") >> (const_int 0)))] >> "" >> "tst\\t%<w>0, <short_mask>" >> [(set_attr "type" "alus_imm")] >> ) > > We can change that independently. I submitted a small patch to fix that. > >> >> should use logics_imm instead of alus_imm. >> >> Alternatively, we could add a new attribute for "compare_type" and use >> that. That would make the test in aarch_macro_fusion_pair_p slightly >> simpler, since it could use get_attr_compare_type without having to >> look at the pattern of prev_set. But there's a danger that we'd >> forget to add the new attribute for new comparison instructions. >> >> I did wonder whether we could simply punt on CC_Zmode, but that's >> not a reliable test. >> >> But I suppose the counter-argument to my questions above is: how bad >> would it be if we fused CMN and TST? They are at least plausible >> fusions, so it probably doesn't matter if we include them too. > > CMN and TST can be fused with conditional branches, but not with CSEL > according to my reading of the SWOG so I guess we’d want to keep them > separate in principle. In practice, I can’t imagine the performance > difference will be measurable in real workloads if they are kept together. > Jennifer’s benchmarking of this patch didn’t show any negative performance > consequences of the more aggressive fusion, and even a slight improvement. As suggested, I used get_attr_type to distinguish TST from CMP/CMN. > > >> >> So: >> >>> --- >>> gcc/config/aarch64/aarch64-fusion-pairs.def | 2 ++ >>> gcc/config/aarch64/aarch64.cc | 22 +++++++++++++ >>> gcc/config/aarch64/tuning_models/neoversev2.h | 5 ++- >>> .../gcc.target/aarch64/fuse_cmp_csel.c | 33 +++++++++++++++++++ >>> .../gcc.target/aarch64/fuse_cmp_cset.c | 31 +++++++++++++++++ >>> 5 files changed, 92 insertions(+), 1 deletion(-) >>> create mode 100644 gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c >>> create mode 100644 gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c >>> >>> diff --git a/gcc/config/aarch64/aarch64-fusion-pairs.def >>> b/gcc/config/aarch64/aarch64-fusion-pairs.def >>> index 9a43b0c8065..bf5e85ba8fe 100644 >>> --- a/gcc/config/aarch64/aarch64-fusion-pairs.def >>> +++ b/gcc/config/aarch64/aarch64-fusion-pairs.def >>> @@ -37,5 +37,7 @@ AARCH64_FUSION_PAIR ("aes+aesmc", AES_AESMC) >>> AARCH64_FUSION_PAIR ("alu+branch", ALU_BRANCH) >>> AARCH64_FUSION_PAIR ("alu+cbz", ALU_CBZ) >>> AARCH64_FUSION_PAIR ("addsub_2reg_const1", ADDSUB_2REG_CONST1) >>> +AARCH64_FUSION_PAIR ("cmp+csel", CMP_CSEL) >>> +AARCH64_FUSION_PAIR ("cmp+cset", CMP_CSET) >>> >>> #undef AARCH64_FUSION_PAIR >>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >>> index 9e51236ce9f..7a0351f7dac 100644 >>> --- a/gcc/config/aarch64/aarch64.cc >>> +++ b/gcc/config/aarch64/aarch64.cc >>> @@ -27348,6 +27348,28 @@ aarch_macro_fusion_pair_p (rtx_insn *prev, >>> rtx_insn *curr) >>> && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr))) >>> return true; >>> >>> + /* Fuse CMP and CSEL. */ >> >> How about adding something like: >> >> The test for CMP is not exact, and includes things like CMN and TST >> as well. However, fusing with those instructions doesn't seem >> inherently bad. >> >>> + if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_CSEL) >>> + && prev_set && curr_set >>> + && GET_CODE (SET_SRC (prev_set)) == COMPARE >>> + && GET_CODE (SET_SRC (curr_set)) == IF_THEN_ELSE >>> + && REG_P (XEXP (SET_SRC (curr_set), 1)) >>> + && REG_P (XEXP (SET_SRC (curr_set), 2)) >> >> I think we could use aarch64_reg_or_zero instead of REG_P for the last >> two lines, which would allow selects involving W/XZR as well. >> >> Looks good to me otherwise, but I'd be interested to hear what others >> think about the CMP match. > > Looks like we may want to make this similar to the AARCH64_FUSE_ALU_BRANCH > case above, maybe share some of the code even? > Thanks, > Kyrill > I was able to share code between FUSE_CMP_CSEL and FUSE_CMP_CSET. > >> >> Thanks, >> Richard >> >>> + && SCALAR_INT_MODE_P (GET_MODE (XEXP (SET_SRC (prev_set), 0))) >>> + && SCALAR_INT_MODE_P (GET_MODE (XEXP (SET_SRC (curr_set), 1))) >>> + && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr))) >>> + return true; >>> + >>> + /* Fuse CMP and CSET. */ >>> + if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_CSET) >>> + && prev_set && curr_set >>> + && GET_CODE (SET_SRC (prev_set)) == COMPARE >>> + && GET_RTX_CLASS (GET_CODE (SET_SRC (curr_set))) == RTX_COMPARE >>> + && REG_P (SET_DEST (curr_set)) >>> + && SCALAR_INT_MODE_P (GET_MODE (XEXP (SET_SRC (prev_set), 0))) >>> + && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr))) >>> + return true; >>> + >>> /* Fuse flag-setting ALU instructions and conditional branch. */ >>> if (aarch64_fusion_enabled_p (AARCH64_FUSE_ALU_BRANCH) >>> && any_condjump_p (curr)) >>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h >>> b/gcc/config/aarch64/tuning_models/neoversev2.h >>> index f76e4ef358f..ae99fab22d8 100644 >>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h >>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h >>> @@ -221,7 +221,10 @@ static const struct tune_params neoversev2_tunings = >>> 2 /* store_pred. */ >>> }, /* memmov_cost. */ >>> 5, /* issue_rate */ >>> - (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops */ >>> + (AARCH64_FUSE_AES_AESMC >>> + | AARCH64_FUSE_CMP_BRANCH >>> + | AARCH64_FUSE_CMP_CSEL >>> + | AARCH64_FUSE_CMP_CSET), /* fusible_ops */ >>> "32:16", /* function_align. */ >>> "4", /* jump_align. */ >>> "32:16", /* loop_align. */ >>> diff --git a/gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c >>> b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c >>> new file mode 100644 >>> index 00000000000..85f302bab98 >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c >>> @@ -0,0 +1,33 @@ >>> +/* { dg-do compile } */ >>> +/* { dg-options "-O2 -mcpu=neoverse-v2" } */ >>> +/* { dg-final { check-function-bodies "**" "" } } */ >>> + >>> +/* >>> +** f1: >>> +** ... >>> +** cmp w[0-9]+, w[0-9]+ >>> +** csel w[0-9]+, w[0-9]+, w[0-9]+, le >>> +** ret >>> +*/ >>> +int f1 (int a, int b, int c) >>> +{ >>> + int cmp = a > b; >>> + int add1 = c + 3; >>> + int add2 = c + 8; >>> + return cmp ? add1 : add2; >>> +} >>> + >>> +/* >>> +** f2: >>> +** ... >>> +** cmp x[0-9]+, x[0-9]+ >>> +** csel x[0-9]+, x[0-9]+, x[0-9]+, le >>> +** ret >>> +*/ >>> +long long f2 (long long a, long long b, long long c) >>> +{ >>> + long long cmp = a > b; >>> + long long add1 = c + 3; >>> + long long add2 = c + 8; >>> + return cmp ? add1 : add2; >>> +} >>> diff --git a/gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c >>> b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c >>> new file mode 100644 >>> index 00000000000..04f1ce2773b >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c >>> @@ -0,0 +1,31 @@ >>> +/* { dg-do compile } */ >>> +/* { dg-options "-O2 -mcpu=neoverse-v2" } */ >>> +/* { dg-final { check-function-bodies "**" "" } } */ >>> + >>> +/* >>> +** f1: >>> +** cmp w[0-9]+, w[0-9]+ >>> +** cset w[0-9]+, gt >>> +** ... >>> +*/ >>> +int g; >>> +int f1 (int a, int b) >>> +{ >>> + int cmp = a > b; >>> + g = cmp + 1; >>> + return cmp; >>> +} >>> + >>> +/* >>> +** f2: >>> +** cmp x[0-9]+, x[0-9]+ >>> +** cset x[0-9]+, gt >>> +** ... >>> +*/ >>> +long long h; >>> +long long f2 (long long a, long long b) >>> +{ >>> + long long cmp = a > b; >>> + h = cmp + 1; >>> + return cmp; >>> +}
smime.p7s
Description: S/MIME cryptographic signature