Re: [PATCH] aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2

Jennifer Schmitz Wed, 31 Jul 2024 05:08:58 -0700

Thanks for the feedback! I updated the patch based on your comments, more 
detailed comments inline below. The updated version was bootstrapped and tested 
again, no regression.
Best,
Jennifer

0001-AArch64-Fuse-CMP-CSEL-and-CMP-CSET-for-mcpu-neoverse.patch
Description: Binary data

> On 25 Jul 2024, at 14:49, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
> 
> 
> 
>> On 25 Jul 2024, at 13:58, Richard Sandiford <richard.sandif...@arm.com> 
>> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>> Thank you for the feedback. I added checks for SCALAR_INT_MODE_P for the 
>>> reg operands of the compare and if-then-else expressions. As it is not 
>>> legal to have different modes in the operand registers, I only added one 
>>> check for each of the expressions.
>>> The updated patch was bootstrapped and tested again.
>>> Best,
>>> Jennifer
>>> 
>>> From 8da609be99fece8130cf1429bd938b2a26c6672b Mon Sep 17 00:00:00 2001
>>> From: Jennifer Schmitz <jschm...@nvidia.com>
>>> Date: Wed, 24 Jul 2024 06:13:59 -0700
>>> Subject: [PATCH] aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2
>>> 
>>> According to the Neoverse V2 Software Optimization Guide (section 4.14), the
>>> instruction pairs CMP+CSEL and CMP+CSET can be fused, which had not been
>>> implemented so far. This patch implements and tests the two fusion pairs.
>>> 
>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>> regression.
>>> There was also no non-noise impact on SPEC CPU2017 benchmark.
>>> OK for mainline?
>>> 
>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>> 
>>> gcc/
>>> 
>>>     * config/aarch64/aarch64.cc (aarch_macro_fusion_pair_p): Implement
>>>     fusion logic.
>>>     * config/aarch64/aarch64-fusion-pairs.def (cmp+csel): New entry.
>>>     (cmp+cset): Likewise.
>>>     * config/aarch64/tuning_models/neoversev2.h: Enable logic in
>>>     field fusible_ops.
>>> 
>>> gcc/testsuite/
>>> 
>>>     * gcc.target/aarch64/fuse_cmp_csel.c: New test.
>>>     * gcc.target/aarch64/fuse_cmp_cset.c: Likewise.
>> 
>> Thanks for the update.
>> 
>> It looks from a quick scan like the main three instructions associated
>> with single-set integer COMPAREs are CMP, CMN and TST.  TST could be
>> distinguished from CMP and CMN based on get_attr_type (), although it
>> looks like:
>> 
>> (define_insn "*and<mode>_compare0"
>> [(set (reg:CC_Z CC_REGNUM)
>>       (compare:CC_Z
>>        (match_operand:SHORT 0 "register_operand" "r")
>>        (const_int 0)))]
>> ""
>> "tst\\t%<w>0, <short_mask>"
>> [(set_attr "type" "alus_imm")]
>> )
> 
> We can change that independently.
I submitted a small patch to fix that.
> 
>> 
>> should use logics_imm instead of alus_imm.
>> 
>> Alternatively, we could add a new attribute for "compare_type" and use
>> that.  That would make the test in aarch_macro_fusion_pair_p slightly
>> simpler, since it could use get_attr_compare_type without having to
>> look at the pattern of prev_set.  But there's a danger that we'd
>> forget to add the new attribute for new comparison instructions.
>> 
>> I did wonder whether we could simply punt on CC_Zmode, but that's
>> not a reliable test.
>> 
>> But I suppose the counter-argument to my questions above is: how bad
>> would it be if we fused CMN and TST?  They are at least plausible
>> fusions, so it probably doesn't matter if we include them too.
> 
> CMN and TST can be fused with conditional branches, but not with CSEL 
> according to my reading of the SWOG so I guess we’d want to keep them 
> separate in principle. In practice, I can’t imagine the performance 
> difference will be measurable in real workloads if they are kept together.
> Jennifer’s benchmarking of this patch didn’t show any negative performance 
> consequences of the more aggressive fusion, and even a slight improvement.
As suggested, I used get_attr_type to distinguish TST from CMP/CMN.
> 
> 
>> 
>> So:
>> 
>>> ---
>>> gcc/config/aarch64/aarch64-fusion-pairs.def   |  2 ++
>>> gcc/config/aarch64/aarch64.cc                 | 22 +++++++++++++
>>> gcc/config/aarch64/tuning_models/neoversev2.h |  5 ++-
>>> .../gcc.target/aarch64/fuse_cmp_csel.c        | 33 +++++++++++++++++++
>>> .../gcc.target/aarch64/fuse_cmp_cset.c        | 31 +++++++++++++++++
>>> 5 files changed, 92 insertions(+), 1 deletion(-)
>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c
>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c
>>> 
>>> diff --git a/gcc/config/aarch64/aarch64-fusion-pairs.def 
>>> b/gcc/config/aarch64/aarch64-fusion-pairs.def
>>> index 9a43b0c8065..bf5e85ba8fe 100644
>>> --- a/gcc/config/aarch64/aarch64-fusion-pairs.def
>>> +++ b/gcc/config/aarch64/aarch64-fusion-pairs.def
>>> @@ -37,5 +37,7 @@ AARCH64_FUSION_PAIR ("aes+aesmc", AES_AESMC)
>>> AARCH64_FUSION_PAIR ("alu+branch", ALU_BRANCH)
>>> AARCH64_FUSION_PAIR ("alu+cbz", ALU_CBZ)
>>> AARCH64_FUSION_PAIR ("addsub_2reg_const1", ADDSUB_2REG_CONST1)
>>> +AARCH64_FUSION_PAIR ("cmp+csel", CMP_CSEL)
>>> +AARCH64_FUSION_PAIR ("cmp+cset", CMP_CSET)
>>> 
>>> #undef AARCH64_FUSION_PAIR
>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>> index 9e51236ce9f..7a0351f7dac 100644
>>> --- a/gcc/config/aarch64/aarch64.cc
>>> +++ b/gcc/config/aarch64/aarch64.cc
>>> @@ -27348,6 +27348,28 @@ aarch_macro_fusion_pair_p (rtx_insn *prev, 
>>> rtx_insn *curr)
>>>      && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
>>>    return true;
>>> 
>>> +  /* Fuse CMP and CSEL.  */
>> 
>> How about adding something like:
>> 
>> The test for CMP is not exact, and includes things like CMN and TST
>> as well.  However, fusing with those instructions doesn't seem
>> inherently bad.
>> 
>>> +  if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_CSEL)
>>> +      && prev_set && curr_set
>>> +      && GET_CODE (SET_SRC (prev_set)) == COMPARE
>>> +      && GET_CODE (SET_SRC (curr_set)) == IF_THEN_ELSE
>>> +      && REG_P (XEXP (SET_SRC (curr_set), 1))
>>> +      && REG_P (XEXP (SET_SRC (curr_set), 2))
>> 
>> I think we could use aarch64_reg_or_zero instead of REG_P for the last
>> two lines, which would allow selects involving W/XZR as well.
>> 
>> Looks good to me otherwise, but I'd be interested to hear what others
>> think about the CMP match.
> 
> Looks like we may want to make this similar to the AARCH64_FUSE_ALU_BRANCH 
> case above, maybe share some of the code even?
> Thanks,
> Kyrill
> 
I was able to share code between FUSE_CMP_CSEL and FUSE_CMP_CSET.
> 
>> 
>> Thanks,
>> Richard
>> 
>>> +      && SCALAR_INT_MODE_P (GET_MODE (XEXP (SET_SRC (prev_set), 0)))
>>> +      && SCALAR_INT_MODE_P (GET_MODE (XEXP (SET_SRC (curr_set), 1)))
>>> +      && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
>>> +    return true;
>>> +
>>> +  /* Fuse CMP and CSET.  */
>>> +  if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_CSET)
>>> +      && prev_set && curr_set
>>> +      && GET_CODE (SET_SRC (prev_set)) == COMPARE
>>> +      && GET_RTX_CLASS (GET_CODE (SET_SRC (curr_set))) == RTX_COMPARE
>>> +      && REG_P (SET_DEST (curr_set))
>>> +      && SCALAR_INT_MODE_P (GET_MODE (XEXP (SET_SRC (prev_set), 0)))
>>> +      && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
>>> +    return true;
>>> +
>>>  /* Fuse flag-setting ALU instructions and conditional branch.  */
>>>  if (aarch64_fusion_enabled_p (AARCH64_FUSE_ALU_BRANCH)
>>>      && any_condjump_p (curr))
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>> index f76e4ef358f..ae99fab22d8 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>> @@ -221,7 +221,10 @@ static const struct tune_params neoversev2_tunings =
>>>    2 /* store_pred.  */
>>>  }, /* memmov_cost.  */
>>>  5, /* issue_rate  */
>>> -  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
>>> +  (AARCH64_FUSE_AES_AESMC
>>> +   | AARCH64_FUSE_CMP_BRANCH
>>> +   | AARCH64_FUSE_CMP_CSEL
>>> +   | AARCH64_FUSE_CMP_CSET), /* fusible_ops  */
>>>  "32:16",   /* function_align.  */
>>>  "4",               /* jump_align.  */
>>>  "32:16",   /* loop_align.  */
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c 
>>> b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c
>>> new file mode 100644
>>> index 00000000000..85f302bab98
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c
>>> @@ -0,0 +1,33 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O2 -mcpu=neoverse-v2" } */
>>> +/* { dg-final { check-function-bodies "**" "" } } */
>>> +
>>> +/*
>>> +** f1:
>>> +**   ...
>>> +**   cmp     w[0-9]+, w[0-9]+
>>> +**   csel    w[0-9]+, w[0-9]+, w[0-9]+, le
>>> +**   ret
>>> +*/
>>> +int f1 (int a, int b, int c)
>>> +{
>>> +  int cmp = a > b;
>>> +  int add1 = c + 3;
>>> +  int add2 = c + 8;
>>> +  return cmp ? add1 : add2;
>>> +}
>>> +
>>> +/*
>>> +** f2:
>>> +**   ...
>>> +**   cmp     x[0-9]+, x[0-9]+
>>> +**   csel    x[0-9]+, x[0-9]+, x[0-9]+, le
>>> +**   ret
>>> +*/
>>> +long long f2 (long long a, long long b, long long c)
>>> +{
>>> + long long cmp = a > b;
>>> +  long long add1 = c + 3;
>>> +  long long add2 = c + 8;
>>> +  return cmp ? add1 : add2;
>>> +}
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c 
>>> b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c
>>> new file mode 100644
>>> index 00000000000..04f1ce2773b
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c
>>> @@ -0,0 +1,31 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O2 -mcpu=neoverse-v2" } */
>>> +/* { dg-final { check-function-bodies "**" "" } } */
>>> +
>>> +/*
>>> +** f1:
>>> +**   cmp     w[0-9]+, w[0-9]+
>>> +**   cset    w[0-9]+, gt
>>> +**   ...
>>> +*/
>>> +int g;
>>> +int f1 (int a, int b)
>>> +{
>>> +  int cmp = a > b;
>>> +  g = cmp + 1;
>>> +  return cmp;
>>> +}
>>> +
>>> +/*
>>> +** f2:
>>> +**   cmp     x[0-9]+, x[0-9]+
>>> +**   cset    x[0-9]+, gt
>>> +**   ...
>>> +*/
>>> +long long h;
>>> +long long f2 (long long a, long long b)
>>> +{
>>> +  long long cmp = a > b;
>>> +  h = cmp + 1;
>>> +  return cmp;
>>> +}

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH] aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2

Reply via email to