Kugan Vivekanandarajah <kvivekana...@nvidia.com> writes:
> Hi Richard,
>
> I want to follow up on this and see if you have a fix for this.

Sorry for the slow response.

>> On 29 Oct 2024, at 9:41 pm, Richard Sandiford <richard.sandif...@arm.com> 
>> wr>> Kugan Vivekanandarajah <kvivekana...@nvidia.com> writes:
>>> Hi,
>>> 
>>> Fix for PR115258 cases a performance regression in some of the TSVC kernels 
>>> by adding additional mov instructions.
>>> This patch fixes this.
>>> 
>>> i.e., When operands are equal, it is likely that all of them get the same 
>>> register similar to:
>>> (insn 19 15 20 3 (set (reg:V2x16QI 62 v30 [117])
>>>        (unspec:V2x16QI [
>>>                (reg:V16QI 62 v30 [orig:102 vect__1.7 ] [102])
>>>                (reg:V16QI 62 v30 [orig:102 vect__1.7 ] [102])
>>>            ] UNSPEC_CONCAT)) "tsvc.c":11:12 4871 {aarch64_combinev16qi}
>>>     (nil))
>>> 
>>> In this case, aarch64_split_combinev16qi would split it with one insm. 
>>> Hence, when the operands are equal, split after reload.
>>> 
>>> Bootstrapped and recession tested on aarch64-linux-gnu, Is this ok for 
>>> trunk?
>> 
>> Thanks for the patch.  I'm not sure this is the right fix though.
>> I'm planning to have a look at the PR once stage 1 closes.
>> 
>> Richard

I'll look at the RA aspect next week if I have time.  But I think the
main problem here is that we're using a two-input TBL when a single-input
TBL would be enough.  Even the two MOVs we had before were unnecessary.

So I think we should go with the patch below to fix the TSVC regression.
Tested on aarch64-linux-gnu.  I'll push on Monday if there are no
objections.

Thanks,
Richard


>From effb6065bec30c3cab8cbd0bb49f1d0248bca737 Mon Sep 17 00:00:00 2001
From: Richard Sandiford <richard.sandif...@arm.com>
Date: Fri, 7 Mar 2025 16:05:11 +0000
Subject: [PATCH] aarch64: Avoid unnecessary use of 2-input TBLs [PR115258]
To: gcc-patches@gcc.gnu.org

When using TBL for (say) a V4SI permutation, the aarch64 port first
asks target-independent code to lower to a V16QI permutation.
Then, during code generation, an input like:

  (reg:V4SI R)

gets converted to:

  (subreg:V16QI (reg:V4SI R) 0)

aarch64_vectorize_vec_perm_const had:

  d.op0 = op0 ? force_reg (op_mode, op0) : NULL_RTX;
  if (op0 == op1)
    d.op1 = d.op0;
  else
    d.op1 = op1 ? force_reg (op_mode, op1) : NULL_RTX;

But subregs (unlike regs) are not shared, so the op0 == op1 check
always failed for this case.  We'd then force each subreg into a
fresh register, meaning that during the later:

  aarch64_expand_vec_perm_1 (d->target, d->op0, d->op1, sel);

there is no way for aarch64_expand_vec_perm_1 to realise that
d->op0 and d->op1 are the same value.  It would therefore generate
a two-input TBL in the testcase, even though a single-input TBL
is enough.

I'm not sure forcing subregs to a fresh regiter is a good idea --
it caused problems for copysign & co. -- but that's not something
to fiddle with during stage 4.  Using op0 == op1 for rtx equality
is independently wrong, so we might as well just fix that for now.

The patch gets rid of extra MOVs that are a regression from GCC 14.

The testcase is based on one from Kugan, itself based on TSVC.

gcc/
        PR target/115258
        * config/aarch64/aarch64.cc (aarch64_vectorize_vec_perm_const): Use
        d.one_vector_p to decide whether op1 should be a copy of op0.

gcc/testsuite/
        PR target/115258
        * gcc.target/aarch64/pr115258_2.c: New test.

Co-authored-by: Kugan Vivekanandarajah <kvivekana...@nvidia.com>
---
 gcc/config/aarch64/aarch64.cc                 |  4 ++--
 gcc/testsuite/gcc.target/aarch64/pr115258_2.c | 17 +++++++++++++++++
 2 files changed, 19 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr115258_2.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9196b8d906c..2e17c76f32a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26739,8 +26739,8 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, 
machine_mode op_mode,
   d.op_vec_flags = aarch64_classify_vector_mode (d.op_mode);
   d.target = target;
   d.op0 = op0 ? force_reg (op_mode, op0) : NULL_RTX;
-  if (op0 == op1)
-    d.op1 = d.op0;
+  if (op0 && d.one_vector_p)
+    d.op1 = copy_rtx (d.op0);
   else
     d.op1 = op1 ? force_reg (op_mode, op1) : NULL_RTX;
   d.testing_p = !target;
diff --git a/gcc/testsuite/gcc.target/aarch64/pr115258_2.c 
b/gcc/testsuite/gcc.target/aarch64/pr115258_2.c
new file mode 100644
index 00000000000..065e1bb9e7a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr115258_2.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mcpu=neoverse-v2" } */
+
+extern __attribute__((aligned(64))) float a[32000], b[32000];
+int dummy(float[32000], float[32000], float);
+
+void s1112() {
+
+  for (int nl = 0; nl < 100000 * 3; nl++) {
+    for (int i = 32000 - 1; i >= 0; i--) {
+      a[i] = b[i] + (float)1.;
+    }
+    dummy(a, b, 0.);
+  }
+}
+
+/* { dg-final { scan-assembler-not {\tmov\tv[0-9]+\.16b,} } } */
-- 
2.25.1

Reply via email to