Kugan Vivekanandarajah <kvivekana...@nvidia.com> writes: > Hi Richard, > > I want to follow up on this and see if you have a fix for this.
Sorry for the slow response. >> On 29 Oct 2024, at 9:41 pm, Richard Sandiford <richard.sandif...@arm.com> >> wr>> Kugan Vivekanandarajah <kvivekana...@nvidia.com> writes: >>> Hi, >>> >>> Fix for PR115258 cases a performance regression in some of the TSVC kernels >>> by adding additional mov instructions. >>> This patch fixes this. >>> >>> i.e., When operands are equal, it is likely that all of them get the same >>> register similar to: >>> (insn 19 15 20 3 (set (reg:V2x16QI 62 v30 [117]) >>> (unspec:V2x16QI [ >>> (reg:V16QI 62 v30 [orig:102 vect__1.7 ] [102]) >>> (reg:V16QI 62 v30 [orig:102 vect__1.7 ] [102]) >>> ] UNSPEC_CONCAT)) "tsvc.c":11:12 4871 {aarch64_combinev16qi} >>> (nil)) >>> >>> In this case, aarch64_split_combinev16qi would split it with one insm. >>> Hence, when the operands are equal, split after reload. >>> >>> Bootstrapped and recession tested on aarch64-linux-gnu, Is this ok for >>> trunk? >> >> Thanks for the patch. I'm not sure this is the right fix though. >> I'm planning to have a look at the PR once stage 1 closes. >> >> Richard I'll look at the RA aspect next week if I have time. But I think the main problem here is that we're using a two-input TBL when a single-input TBL would be enough. Even the two MOVs we had before were unnecessary. So I think we should go with the patch below to fix the TSVC regression. Tested on aarch64-linux-gnu. I'll push on Monday if there are no objections. Thanks, Richard >From effb6065bec30c3cab8cbd0bb49f1d0248bca737 Mon Sep 17 00:00:00 2001 From: Richard Sandiford <richard.sandif...@arm.com> Date: Fri, 7 Mar 2025 16:05:11 +0000 Subject: [PATCH] aarch64: Avoid unnecessary use of 2-input TBLs [PR115258] To: gcc-patches@gcc.gnu.org When using TBL for (say) a V4SI permutation, the aarch64 port first asks target-independent code to lower to a V16QI permutation. Then, during code generation, an input like: (reg:V4SI R) gets converted to: (subreg:V16QI (reg:V4SI R) 0) aarch64_vectorize_vec_perm_const had: d.op0 = op0 ? force_reg (op_mode, op0) : NULL_RTX; if (op0 == op1) d.op1 = d.op0; else d.op1 = op1 ? force_reg (op_mode, op1) : NULL_RTX; But subregs (unlike regs) are not shared, so the op0 == op1 check always failed for this case. We'd then force each subreg into a fresh register, meaning that during the later: aarch64_expand_vec_perm_1 (d->target, d->op0, d->op1, sel); there is no way for aarch64_expand_vec_perm_1 to realise that d->op0 and d->op1 are the same value. It would therefore generate a two-input TBL in the testcase, even though a single-input TBL is enough. I'm not sure forcing subregs to a fresh regiter is a good idea -- it caused problems for copysign & co. -- but that's not something to fiddle with during stage 4. Using op0 == op1 for rtx equality is independently wrong, so we might as well just fix that for now. The patch gets rid of extra MOVs that are a regression from GCC 14. The testcase is based on one from Kugan, itself based on TSVC. gcc/ PR target/115258 * config/aarch64/aarch64.cc (aarch64_vectorize_vec_perm_const): Use d.one_vector_p to decide whether op1 should be a copy of op0. gcc/testsuite/ PR target/115258 * gcc.target/aarch64/pr115258_2.c: New test. Co-authored-by: Kugan Vivekanandarajah <kvivekana...@nvidia.com> --- gcc/config/aarch64/aarch64.cc | 4 ++-- gcc/testsuite/gcc.target/aarch64/pr115258_2.c | 17 +++++++++++++++++ 2 files changed, 19 insertions(+), 2 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/pr115258_2.c diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 9196b8d906c..2e17c76f32a 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -26739,8 +26739,8 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode, d.op_vec_flags = aarch64_classify_vector_mode (d.op_mode); d.target = target; d.op0 = op0 ? force_reg (op_mode, op0) : NULL_RTX; - if (op0 == op1) - d.op1 = d.op0; + if (op0 && d.one_vector_p) + d.op1 = copy_rtx (d.op0); else d.op1 = op1 ? force_reg (op_mode, op1) : NULL_RTX; d.testing_p = !target; diff --git a/gcc/testsuite/gcc.target/aarch64/pr115258_2.c b/gcc/testsuite/gcc.target/aarch64/pr115258_2.c new file mode 100644 index 00000000000..065e1bb9e7a --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr115258_2.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -mcpu=neoverse-v2" } */ + +extern __attribute__((aligned(64))) float a[32000], b[32000]; +int dummy(float[32000], float[32000], float); + +void s1112() { + + for (int nl = 0; nl < 100000 * 3; nl++) { + for (int i = 32000 - 1; i >= 0; i--) { + a[i] = b[i] + (float)1.; + } + dummy(a, b, 0.); + } +} + +/* { dg-final { scan-assembler-not {\tmov\tv[0-9]+\.16b,} } } */ -- 2.25.1