Re: [PATCH] aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2

Kyrylo Tkachov Thu, 25 Jul 2024 05:49:53 -0700


> On 25 Jul 2024, at 13:58, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Jennifer Schmitz <jschm...@nvidia.com> writes:
>> Thank you for the feedback. I added checks for SCALAR_INT_MODE_P for the reg 
>> operands of the compare and if-then-else expressions. As it is not legal to 
>> have different modes in the operand registers, I only added one check for 
>> each of the expressions.
>> The updated patch was bootstrapped and tested again.
>> Best,
>> Jennifer
>> 
>> From 8da609be99fece8130cf1429bd938b2a26c6672b Mon Sep 17 00:00:00 2001
>> From: Jennifer Schmitz <jschm...@nvidia.com>
>> Date: Wed, 24 Jul 2024 06:13:59 -0700
>> Subject: [PATCH] aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2
>> 
>> According to the Neoverse V2 Software Optimization Guide (section 4.14), the
>> instruction pairs CMP+CSEL and CMP+CSET can be fused, which had not been
>> implemented so far. This patch implements and tests the two fusion pairs.
>> 
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> There was also no non-noise impact on SPEC CPU2017 benchmark.
>> OK for mainline?
>> 
>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>> 
>> gcc/
>> 
>>      * config/aarch64/aarch64.cc (aarch_macro_fusion_pair_p): Implement
>>      fusion logic.
>>      * config/aarch64/aarch64-fusion-pairs.def (cmp+csel): New entry.
>>      (cmp+cset): Likewise.
>>      * config/aarch64/tuning_models/neoversev2.h: Enable logic in
>>      field fusible_ops.
>> 
>> gcc/testsuite/
>> 
>>      * gcc.target/aarch64/fuse_cmp_csel.c: New test.
>>      * gcc.target/aarch64/fuse_cmp_cset.c: Likewise.
> 
> Thanks for the update.
> 
> It looks from a quick scan like the main three instructions associated
> with single-set integer COMPAREs are CMP, CMN and TST.  TST could be
> distinguished from CMP and CMN based on get_attr_type (), although it
> looks like:
> 
> (define_insn "*and<mode>_compare0"
>  [(set (reg:CC_Z CC_REGNUM)
>        (compare:CC_Z
>         (match_operand:SHORT 0 "register_operand" "r")
>         (const_int 0)))]
>  ""
>  "tst\\t%<w>0, <short_mask>"
>  [(set_attr "type" "alus_imm")]
> )


We can change that independently.

> 
> should use logics_imm instead of alus_imm.
> 
> Alternatively, we could add a new attribute for "compare_type" and use
> that.  That would make the test in aarch_macro_fusion_pair_p slightly
> simpler, since it could use get_attr_compare_type without having to
> look at the pattern of prev_set.  But there's a danger that we'd
> forget to add the new attribute for new comparison instructions.
> 
> I did wonder whether we could simply punt on CC_Zmode, but that's
> not a reliable test.
> 
> But I suppose the counter-argument to my questions above is: how bad
> would it be if we fused CMN and TST?  They are at least plausible
> fusions, so it probably doesn't matter if we include them too.

CMN and TST can be fused with conditional branches, but not with CSEL according 
to my reading of the SWOG so I guess we’d want to keep them separate in 
principle. In practice, I can’t imagine the performance difference will be 
measurable in real workloads if they are kept together.
Jennifer’s benchmarking of this patch didn’t show any negative performance 
consequences of the more aggressive fusion, and even a slight improvement.


> 
> So:
> 
>> ---
>> gcc/config/aarch64/aarch64-fusion-pairs.def   |  2 ++
>> gcc/config/aarch64/aarch64.cc                 | 22 +++++++++++++
>> gcc/config/aarch64/tuning_models/neoversev2.h |  5 ++-
>> .../gcc.target/aarch64/fuse_cmp_csel.c        | 33 +++++++++++++++++++
>> .../gcc.target/aarch64/fuse_cmp_cset.c        | 31 +++++++++++++++++
>> 5 files changed, 92 insertions(+), 1 deletion(-)
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c
>> 
>> diff --git a/gcc/config/aarch64/aarch64-fusion-pairs.def 
>> b/gcc/config/aarch64/aarch64-fusion-pairs.def
>> index 9a43b0c8065..bf5e85ba8fe 100644
>> --- a/gcc/config/aarch64/aarch64-fusion-pairs.def
>> +++ b/gcc/config/aarch64/aarch64-fusion-pairs.def
>> @@ -37,5 +37,7 @@ AARCH64_FUSION_PAIR ("aes+aesmc", AES_AESMC)
>> AARCH64_FUSION_PAIR ("alu+branch", ALU_BRANCH)
>> AARCH64_FUSION_PAIR ("alu+cbz", ALU_CBZ)
>> AARCH64_FUSION_PAIR ("addsub_2reg_const1", ADDSUB_2REG_CONST1)
>> +AARCH64_FUSION_PAIR ("cmp+csel", CMP_CSEL)
>> +AARCH64_FUSION_PAIR ("cmp+cset", CMP_CSET)
>> 
>> #undef AARCH64_FUSION_PAIR
>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> index 9e51236ce9f..7a0351f7dac 100644
>> --- a/gcc/config/aarch64/aarch64.cc
>> +++ b/gcc/config/aarch64/aarch64.cc
>> @@ -27348,6 +27348,28 @@ aarch_macro_fusion_pair_p (rtx_insn *prev, rtx_insn 
>> *curr)
>>       && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
>>     return true;
>> 
>> +  /* Fuse CMP and CSEL.  */
> 
> How about adding something like:
> 
>  The test for CMP is not exact, and includes things like CMN and TST
>  as well.  However, fusing with those instructions doesn't seem
>  inherently bad.
> 
>> +  if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_CSEL)
>> +      && prev_set && curr_set
>> +      && GET_CODE (SET_SRC (prev_set)) == COMPARE
>> +      && GET_CODE (SET_SRC (curr_set)) == IF_THEN_ELSE
>> +      && REG_P (XEXP (SET_SRC (curr_set), 1))
>> +      && REG_P (XEXP (SET_SRC (curr_set), 2))
> 
> I think we could use aarch64_reg_or_zero instead of REG_P for the last
> two lines, which would allow selects involving W/XZR as well.
> 
> Looks good to me otherwise, but I'd be interested to hear what others
> think about the CMP match.

Looks like we may want to make this similar to the AARCH64_FUSE_ALU_BRANCH case 
above, maybe share some of the code even?
Thanks,
Kyrill


> 
> Thanks,
> Richard
> 
>> +      && SCALAR_INT_MODE_P (GET_MODE (XEXP (SET_SRC (prev_set), 0)))
>> +      && SCALAR_INT_MODE_P (GET_MODE (XEXP (SET_SRC (curr_set), 1)))
>> +      && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
>> +    return true;
>> +
>> +  /* Fuse CMP and CSET.  */
>> +  if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_CSET)
>> +      && prev_set && curr_set
>> +      && GET_CODE (SET_SRC (prev_set)) == COMPARE
>> +      && GET_RTX_CLASS (GET_CODE (SET_SRC (curr_set))) == RTX_COMPARE
>> +      && REG_P (SET_DEST (curr_set))
>> +      && SCALAR_INT_MODE_P (GET_MODE (XEXP (SET_SRC (prev_set), 0)))
>> +      && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
>> +    return true;
>> +
>>   /* Fuse flag-setting ALU instructions and conditional branch.  */
>>   if (aarch64_fusion_enabled_p (AARCH64_FUSE_ALU_BRANCH)
>>       && any_condjump_p (curr))
>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>> index f76e4ef358f..ae99fab22d8 100644
>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>> @@ -221,7 +221,10 @@ static const struct tune_params neoversev2_tunings =
>>     2 /* store_pred.  */
>>   }, /* memmov_cost.  */
>>   5, /* issue_rate  */
>> -  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
>> +  (AARCH64_FUSE_AES_AESMC
>> +   | AARCH64_FUSE_CMP_BRANCH
>> +   | AARCH64_FUSE_CMP_CSEL
>> +   | AARCH64_FUSE_CMP_CSET), /* fusible_ops  */
>>   "32:16",   /* function_align.  */
>>   "4",               /* jump_align.  */
>>   "32:16",   /* loop_align.  */
>> diff --git a/gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c 
>> b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c
>> new file mode 100644
>> index 00000000000..85f302bab98
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c
>> @@ -0,0 +1,33 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -mcpu=neoverse-v2" } */
>> +/* { dg-final { check-function-bodies "**" "" } } */
>> +
>> +/*
>> +** f1:
>> +**   ...
>> +**   cmp     w[0-9]+, w[0-9]+
>> +**   csel    w[0-9]+, w[0-9]+, w[0-9]+, le
>> +**   ret
>> +*/
>> +int f1 (int a, int b, int c)
>> +{
>> +  int cmp = a > b;
>> +  int add1 = c + 3;
>> +  int add2 = c + 8;
>> +  return cmp ? add1 : add2;
>> +}
>> +
>> +/*
>> +** f2:
>> +**   ...
>> +**   cmp     x[0-9]+, x[0-9]+
>> +**   csel    x[0-9]+, x[0-9]+, x[0-9]+, le
>> +**   ret
>> +*/
>> +long long f2 (long long a, long long b, long long c)
>> +{
>> + long long cmp = a > b;
>> +  long long add1 = c + 3;
>> +  long long add2 = c + 8;
>> +  return cmp ? add1 : add2;
>> +}
>> diff --git a/gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c 
>> b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c
>> new file mode 100644
>> index 00000000000..04f1ce2773b
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c
>> @@ -0,0 +1,31 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -mcpu=neoverse-v2" } */
>> +/* { dg-final { check-function-bodies "**" "" } } */
>> +
>> +/*
>> +** f1:
>> +**   cmp     w[0-9]+, w[0-9]+
>> +**   cset    w[0-9]+, gt
>> +**   ...
>> +*/
>> +int g;
>> +int f1 (int a, int b)
>> +{
>> +  int cmp = a > b;
>> +  g = cmp + 1;
>> +  return cmp;
>> +}
>> +
>> +/*
>> +** f2:
>> +**   cmp     x[0-9]+, x[0-9]+
>> +**   cset    x[0-9]+, gt
>> +**   ...
>> +*/
>> +long long h;
>> +long long f2 (long long a, long long b)
>> +{
>> +  long long cmp = a > b;
>> +  h = cmp + 1;
>> +  return cmp;
>> +}

Re: [PATCH] aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2

Reply via email to