from:"Tamar Christina via Gcc\-cvs"

[gcc r15-3477] docs: double mention of armv9-a.

2024-09-05 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:240be78237c6d70e0b30ed187c559e359ce81557

commit r15-3477-g240be78237c6d70e0b30ed187c559e359ce81557
Author: Tamar Christina 
Date:   Thu Sep 5 10:35:18 2024 +0100

docs: double mention of armv9-a.

The list of available architecture for Arm is incorrectly listing armv9-a 
twice.
This removes the duplicate armv9-a enumeration from the part of the list 
having
M-profile targets.

gcc/ChangeLog:

* doc/invoke.texi: Remove duplicate armv9-a mention.

Diff:
---
 gcc/doc/invoke.texi | 1 -
 1 file changed, 1 deletion(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 43afb0984e5..193db761d64 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -23025,7 +23025,6 @@ Permissible names are:
 @samp{armv7-m}, @samp{armv7e-m},
 @samp{armv8-m.base}, @samp{armv8-m.main},
 @samp{armv8.1-m.main},
-@samp{armv9-a},
 @samp{iwmmxt} and @samp{iwmmxt2}.
 
 Additionally, the following architectures, which lack support for the

[gcc r15-3478] testsuite: remove -fwrapv from signbit-5.c

2024-09-05 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:67eaf67360e434dd5969e1c66f043e3c751f9f52

commit r15-3478-g67eaf67360e434dd5969e1c66f043e3c751f9f52
Author: Tamar Christina 
Date:   Thu Sep 5 10:36:02 2024 +0100

testsuite: remove -fwrapv from signbit-5.c

The meaning of the testcase was changed by passing it -fwrapv.  The reason 
for
the test failures on some platform was because the test was testing some
implementation defined behavior wrt INT_MIN in generic code.

Instead of using -fwrapv this just removes the border case from the test so
all the values now have a defined semantic.  It still relies on the 
handling of
shifting a negative value right, but that wasn't changed with -fwrapv 
anyway.

The -fwrapv case is being handled already by other testcases.

gcc/testsuite/ChangeLog:

* gcc.dg/signbit-5.c: Remove -fwrapv and change INT_MIN to 
INT_MIN+1.

Diff:
---
 gcc/testsuite/gcc.dg/signbit-5.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/signbit-5.c b/gcc/testsuite/gcc.dg/signbit-5.c
index 57e29e3ca63..2601582ed4e 100644
--- a/gcc/testsuite/gcc.dg/signbit-5.c
+++ b/gcc/testsuite/gcc.dg/signbit-5.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O3 -fwrapv" } */
+/* { dg-options "-O3" } */
 
 /* This test does not work when the truth type does not match vector type.  */
 /* { dg-additional-options "-march=armv8-a" { target aarch64_sve } } */
@@ -42,8 +42,8 @@ int main ()
   TYPE a[N];
   TYPE b[N];
 
-  a[0] = INT_MIN;
-  b[0] = INT_MIN;
+  a[0] = INT_MIN+1;
+  b[0] = INT_MIN+1;
 
   for (int i = 1; i < N; ++i)
 {

[gcc r15-3479] middle-end: have vect_recog_cond_store_pattern use pattern statement for cond if available

2024-09-05 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:a50f54c0d06139d791b875e09471f2fc03af5b04

commit r15-3479-ga50f54c0d06139d791b875e09471f2fc03af5b04
Author: Tamar Christina 
Date:   Thu Sep 5 10:36:55 2024 +0100

middle-end: have vect_recog_cond_store_pattern use pattern statement for 
cond if available

When vectorizing a conditional operation we rely on the bool_recog pattern 
to
hit and convert the bool of the operand to a valid mask.

However we are currently not using the converted operand as this is in a 
pattern
statement.  This change updates it to look at the actual statement to be
vectorized so we pick up the pattern.

Note that there are no tests here since vectorization will fail until we
correctly lower all boolean conditionals early.

Tests for these are in the next patch, namely vect-conditional_store_5.c and
vect-conditional_store_6.c.  And the existing vect-conditional_store_[1-4].c
checks that the other cases are still handled correctly.

gcc/ChangeLog:

* tree-vect-patterns.cc (vect_recog_cond_store_pattern): Use pattern
statement.

Diff:
---
 gcc/tree-vect-patterns.cc | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 3162250bbdd..f7c3c623ea4 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -6670,7 +6670,15 @@ vect_recog_cond_store_pattern (vec_info *vinfo,
   if (TREE_CODE (st_rhs) != SSA_NAME)
 return NULL;
 
-  gassign *cond_stmt = dyn_cast (SSA_NAME_DEF_STMT (st_rhs));
+  auto cond_vinfo = vinfo->lookup_def (st_rhs);
+
+  /* If the condition isn't part of the loop then bool recog wouldn't have seen
+ it and so this transformation may not be valid.  */
+  if (!cond_vinfo)
+return NULL;
+
+  cond_vinfo = vect_stmt_to_vectorize (cond_vinfo);
+  gassign *cond_stmt = dyn_cast (STMT_VINFO_STMT (cond_vinfo));
   if (!cond_stmt || gimple_assign_rhs_code (cond_stmt) != COND_EXPR)
 return NULL;

[gcc r15-3518] middle-end: check that the lhs of a COND_EXPR is an SSA_NAME in cond_store recognition [PR116628]

2024-09-06 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:2c4438d39156493b5b382eb48b1f884ca5ab7ed4

commit r15-3518-g2c4438d39156493b5b382eb48b1f884ca5ab7ed4
Author: Tamar Christina 
Date:   Fri Sep 6 14:05:43 2024 +0100

middle-end: check that the lhs of a COND_EXPR is an SSA_NAME in cond_store 
recognition [PR116628]

Because the vect_recog_bool_pattern can at the moment still transition
out of GIMPLE and back into GENERIC the vect_recog_cond_store_pattern can
end up using an expression as a mask rather than an SSA_NAME.

This adds an explicit check that we have a mask and not an expression.

gcc/ChangeLog:

PR tree-optimization/116628
* tree-vect-patterns.cc (vect_recog_cond_store_pattern): Add 
SSA_NAME
check on expression.

gcc/testsuite/ChangeLog:

PR tree-optimization/116628
* gcc.dg/vect/pr116628.c: New test.

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr116628.c | 14 ++
 gcc/tree-vect-patterns.cc|  3 +++
 2 files changed, 17 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/vect/pr116628.c 
b/gcc/testsuite/gcc.dg/vect/pr116628.c
new file mode 100644
index 000..4068c657ac5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr116628.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_float } */
+/* { dg-require-effective-target vect_masked_store } */
+/* { dg-additional-options "-Ofast -march=armv9-a" { target aarch64-*-* } } */
+
+typedef float c;
+c a[2000], b[0];
+void d() {
+  for (int e = 0; e < 2000; e++)
+if (b[e])
+  a[e] = b[e];
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index f7c3c623ea4..3a0d4cb7092 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -6685,6 +6685,9 @@ vect_recog_cond_store_pattern (vec_info *vinfo,
   /* Check if the else value matches the original loaded one.  */
   bool invert = false;
   tree cmp_ls = gimple_arg (cond_stmt, 0);
+  if (TREE_CODE (cmp_ls) != SSA_NAME)
+return NULL;
+
   tree cond_arg1 = gimple_arg (cond_stmt, 1);
   tree cond_arg2 = gimple_arg (cond_stmt, 2);

[gcc r15-1808] ivopts: fix wide_int_constant_multiple_p when VAL and DIV are 0. [PR114932]

2024-07-03 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:25127123100f04c2d5d70c6933a5f5aedcd69c40

commit r15-1808-g25127123100f04c2d5d70c6933a5f5aedcd69c40
Author: Tamar Christina 
Date:   Wed Jul 3 09:30:28 2024 +0100

ivopts: fix wide_int_constant_multiple_p when VAL and DIV are 0.  [PR114932]

wide_int_constant_multiple_p tries to check if for two tree expressions a 
and b
that there is a multiplier which makes a == b * c.

This code however seems to think that there's no c where a=0 and b=0 are 
equal
which is of course wrong.

This fixes it and also fixes the comment.

gcc/ChangeLog:

PR tree-optimization/114932
* tree-affine.cc (wide_int_constant_multiple_p): Support 0 and 0 
being
multiples.

Diff:
---
 gcc/tree-affine.cc | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/gcc/tree-affine.cc b/gcc/tree-affine.cc
index d6309c43903..76117aa4fd6 100644
--- a/gcc/tree-affine.cc
+++ b/gcc/tree-affine.cc
@@ -880,11 +880,11 @@ free_affine_expand_cache (hash_map **cache)
   *cache = NULL;
 }
 
-/* If VAL != CST * DIV for any constant CST, returns false.
-   Otherwise, if *MULT_SET is true, additionally compares CST and MULT,
-   and if they are different, returns false.  Finally, if neither of these
-   two cases occur, true is returned, and CST is stored to MULT and MULT_SET
-   is set to true.  */
+/* If VAL == CST * DIV for any constant CST, returns true.
+   and if *MULT_SET is true, additionally compares CST and MULT
+   and if they are different, returns false.  If true is returned, CST is
+   stored to MULT and MULT_SET is set to true unless VAL and DIV are both zero
+   in which case neither MULT nor MULT_SET are updated.  */
 
 static bool
 wide_int_constant_multiple_p (const poly_widest_int &val,
@@ -895,6 +895,9 @@ wide_int_constant_multiple_p (const poly_widest_int &val,
 
   if (known_eq (val, 0))
 {
+  if (known_eq (div, 0))
+   return true;
+
   if (*mult_set && maybe_ne (*mult, 0))
return false;
   *mult_set = true;

[gcc r15-1809] ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-03 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:735edbf1e2479fa2323a2b4a9714fae1a0925f74

commit r15-1809-g735edbf1e2479fa2323a2b4a9714fae1a0925f74
Author: Tamar Christina 
Date:   Wed Jul 3 09:31:09 2024 +0100

ivopts: replace constant_multiple_of with 
aff_combination_constant_multiple_p [PR114932]

The current implementation of constant_multiple_of is doing a more limited
version of aff_combination_constant_multiple_p.

The only non-debug usage of constant_multiple_of will proceed with the 
values
as affine trees.  There is scope for further optimization here, namely I 
believe
that if constant_multiple_of returns the aff_tree after the conversion then
get_computation_aff_1 can use it instead of manually creating the aff_tree.

However I think it makes sense to first commit this smaller change and then
incrementally change things.

gcc/ChangeLog:

PR tree-optimization/114932
* tree-ssa-loop-ivopts.cc (constant_multiple_of): Use
aff_combination_constant_multiple_p instead.

Diff:
---
 gcc/tree-ssa-loop-ivopts.cc | 66 ++---
 1 file changed, 8 insertions(+), 58 deletions(-)

diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index 7cae5bdefea..c3218a3e8ee 100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -2146,65 +2146,15 @@ idx_record_use (tree base, tree *idx,
 static bool
 constant_multiple_of (tree top, tree bot, widest_int *mul)
 {
-  tree mby;
-  enum tree_code code;
-  unsigned precision = TYPE_PRECISION (TREE_TYPE (top));
-  widest_int res, p0, p1;
-
-  STRIP_NOPS (top);
-  STRIP_NOPS (bot);
-
-  if (operand_equal_p (top, bot, 0))
-{
-  *mul = 1;
-  return true;
-}
-
-  code = TREE_CODE (top);
-  switch (code)
-{
-case MULT_EXPR:
-  mby = TREE_OPERAND (top, 1);
-  if (TREE_CODE (mby) != INTEGER_CST)
-   return false;
-
-  if (!constant_multiple_of (TREE_OPERAND (top, 0), bot, &res))
-   return false;
-
-  *mul = wi::sext (res * wi::to_widest (mby), precision);
-  return true;
-
-case PLUS_EXPR:
-case MINUS_EXPR:
-  if (!constant_multiple_of (TREE_OPERAND (top, 0), bot, &p0)
- || !constant_multiple_of (TREE_OPERAND (top, 1), bot, &p1))
-   return false;
-
-  if (code == MINUS_EXPR)
-   p1 = -p1;
-  *mul = wi::sext (p0 + p1, precision);
-  return true;
-
-case INTEGER_CST:
-  if (TREE_CODE (bot) != INTEGER_CST)
-   return false;
-
-  p0 = widest_int::from (wi::to_wide (top), SIGNED);
-  p1 = widest_int::from (wi::to_wide (bot), SIGNED);
-  if (p1 == 0)
-   return false;
-  *mul = wi::sext (wi::divmod_trunc (p0, p1, SIGNED, &res), precision);
-  return res == 0;
-
-default:
-  if (POLY_INT_CST_P (top)
- && POLY_INT_CST_P (bot)
- && constant_multiple_p (wi::to_poly_widest (top),
- wi::to_poly_widest (bot), mul))
-   return true;
+  aff_tree aff_top, aff_bot;
+  tree_to_aff_combination (top, TREE_TYPE (top), &aff_top);
+  tree_to_aff_combination (bot, TREE_TYPE (bot), &aff_bot);
+  poly_widest_int poly_mul;
+  if (aff_combination_constant_multiple_p (&aff_top, &aff_bot, &poly_mul)
+  && poly_mul.is_constant (mul))
+return true;
 
-  return false;
-}
+  return false;
 }
 
 /* Return true if memory reference REF with step STEP may be unaligned.  */

[gcc r15-1841] c++ frontend: check for missing condition for novector [PR115623]

2024-07-04 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:84acbfbecbdbc3fb2a395bd97e338b2b26fad374

commit r15-1841-g84acbfbecbdbc3fb2a395bd97e338b2b26fad374
Author: Tamar Christina 
Date:   Thu Jul 4 11:01:55 2024 +0100

c++ frontend: check for missing condition for novector [PR115623]

It looks like I forgot to check in the C++ frontend if a condition exist 
for the
loop being adorned with novector.  This causes a segfault because cond isn't
expected to be null.

This fixes it by issuing ignoring the pragma when there's no loop condition
the same way we do in the C frontend.

gcc/cp/ChangeLog:

PR c++/115623
* semantics.cc (finish_for_cond): Add check for C++ cond.

gcc/testsuite/ChangeLog:

PR c++/115623
* g++.dg/vect/vect-novector-pragma_2.cc: New test.

Diff:
---
 gcc/cp/semantics.cc |  2 +-
 gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc | 10 ++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
index 12d79bdbb3f..cd3df13772d 100644
--- a/gcc/cp/semantics.cc
+++ b/gcc/cp/semantics.cc
@@ -1510,7 +1510,7 @@ finish_for_cond (tree cond, tree for_stmt, bool ivdep, 
tree unroll,
  build_int_cst (integer_type_node,
 annot_expr_unroll_kind),
  unroll);
-  if (novector && cond != error_mark_node)
+  if (novector && cond && cond != error_mark_node)
 FOR_COND (for_stmt) = build3 (ANNOTATE_EXPR,
  TREE_TYPE (FOR_COND (for_stmt)),
  FOR_COND (for_stmt),
diff --git a/gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc 
b/gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc
new file mode 100644
index 000..d2a8eee8d71
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+
+void f (char *a, int i)
+{
+#pragma GCC novector
+  for (;;i++)
+a[i] *= 2;
+}
+
+

[gcc r14-10378] c++ frontend: check for missing condition for novector [PR115623]

2024-07-04 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:1742b699c31e3ac4dadbedb6036ee2498b569259

commit r14-10378-g1742b699c31e3ac4dadbedb6036ee2498b569259
Author: Tamar Christina 
Date:   Thu Jul 4 11:01:55 2024 +0100

c++ frontend: check for missing condition for novector [PR115623]

It looks like I forgot to check in the C++ frontend if a condition exist 
for the
loop being adorned with novector.  This causes a segfault because cond isn't
expected to be null.

This fixes it by issuing ignoring the pragma when there's no loop condition
the same way we do in the C frontend.

gcc/cp/ChangeLog:

PR c++/115623
* semantics.cc (finish_for_cond): Add check for C++ cond.

gcc/testsuite/ChangeLog:

PR c++/115623
* g++.dg/vect/vect-novector-pragma_2.cc: New test.

(cherry picked from commit 84acbfbecbdbc3fb2a395bd97e338b2b26fad374)

Diff:
---
 gcc/cp/semantics.cc |  2 +-
 gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc | 10 ++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
index b18fc7c61be..ec741c0b203 100644
--- a/gcc/cp/semantics.cc
+++ b/gcc/cp/semantics.cc
@@ -1501,7 +1501,7 @@ finish_for_cond (tree cond, tree for_stmt, bool ivdep, 
tree unroll,
  build_int_cst (integer_type_node,
 annot_expr_unroll_kind),
  unroll);
-  if (novector && cond != error_mark_node)
+  if (novector && cond && cond != error_mark_node)
 FOR_COND (for_stmt) = build3 (ANNOTATE_EXPR,
  TREE_TYPE (FOR_COND (for_stmt)),
  FOR_COND (for_stmt),
diff --git a/gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc 
b/gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc
new file mode 100644
index 000..d2a8eee8d71
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+
+void f (char *a, int i)
+{
+#pragma GCC novector
+  for (;;i++)
+a[i] *= 2;
+}
+
+

[gcc r15-1842] testsuite: Update test for PR115537 to use SVE .

2024-07-04 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:adcfb4fb8fb20a911c795312ff5f5284dba05275

commit r15-1842-gadcfb4fb8fb20a911c795312ff5f5284dba05275
Author: Tamar Christina 
Date:   Thu Jul 4 11:19:20 2024 +0100

testsuite: Update test for PR115537 to use SVE .

The PR was about SVE codegen, the testcase accidentally used neoverse-n1
instead of neoverse-v1 as was the original report.

This updates the tool options.

gcc/testsuite/ChangeLog:

PR tree-optimization/115537
* gcc.dg/vect/pr115537.c: Update flag from neoverse-n1 to 
neoverse-v1.

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115537.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115537.c 
b/gcc/testsuite/gcc.dg/vect/pr115537.c
index 99ed467feb8..9f7347a5f2a 100644
--- a/gcc/testsuite/gcc.dg/vect/pr115537.c
+++ b/gcc/testsuite/gcc.dg/vect/pr115537.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-additional-options "-mcpu=neoverse-n1" { target aarch64*-*-* } } */
+/* { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } } */
 
 char *a;
 int b;

[gcc r15-1855] AArch64: remove aarch64_simd_vec_unpack_lo_

2024-07-05 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:6ff698106644af39da9e0eda51974fdcd111280d

commit r15-1855-g6ff698106644af39da9e0eda51974fdcd111280d
Author: Tamar Christina 
Date:   Fri Jul 5 12:09:21 2024 +0100

AArch64: remove aarch64_simd_vec_unpack_lo_

The fix for PR18127 reworked the uxtl to zip optimization.
In doing so it undid the changes in aarch64_simd_vec_unpack_lo_ and 
this now
no longer matches aarch64_simd_vec_unpack_hi_.  It still works because 
the
RTL generated by aarch64_simd_vec_unpack_lo_ overlaps with the general 
zero
extend RTL and so because that one is listed before the lo pattern recog 
picks
it instead.

This removes aarch64_simd_vec_unpack_lo_.

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md
(aarch64_simd_vec_unpack_lo_): Remove.
(vec_unpack_lo__lo_"
-  [(set (match_operand: 0 "register_operand" "=w")
-(ANY_EXTEND: (vec_select:
-  (match_operand:VQW 1 "register_operand" "w")
-  (match_operand:VQW 2 "vect_par_cnst_lo_half" "")
-   )))]
-  "TARGET_SIMD"
-  "xtl\t%0., %1."
-  [(set_attr "type" "neon_shift_imm_long")]
-)
-
 (define_insn_and_split "aarch64_simd_vec_unpack_hi_"
   [(set (match_operand: 0 "register_operand" "=w")
 (ANY_EXTEND: (vec_select:
@@ -1952,14 +1941,11 @@
 )
 
 (define_expand "vec_unpack_lo_"
-  [(match_operand: 0 "register_operand")
-   (ANY_EXTEND: (match_operand:VQW 1 "register_operand"))]
+  [(set (match_operand: 0 "register_operand")
+   (ANY_EXTEND: (match_operand:VQW 1 "register_operand")))]
   "TARGET_SIMD"
   {
-rtx p = aarch64_simd_vect_par_cnst_half (mode, , false);
-emit_insn (gen_aarch64_simd_vec_unpack_lo_ (operands[0],
- operands[1], p));
-DONE;
+operands[1] = lowpart_subreg (mode, operands[1], mode);
   }
 )
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 6b106a72e49..469eb938953 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -23188,7 +23188,8 @@ aarch64_gen_shareable_zero (machine_mode mode)
to split without that restriction and instead recombine shared zeros
if they turn out not to be worthwhile.  This would allow splits in
single-block functions and would also cope more naturally with
-   rematerialization.  */
+   rematerialization.  The downside of not doing this is that we lose the
+   optimizations for vector epilogues as well.  */
 
 bool
 aarch64_split_simd_shift_p (rtx_insn *insn)

[gcc r15-1856] AArch64: lower 2 reg TBL permutes with one zero register to 1 reg TBL.

2024-07-05 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:97fcfeac3dcc433b792711fd840b92fa3e860733

commit r15-1856-g97fcfeac3dcc433b792711fd840b92fa3e860733
Author: Tamar Christina 
Date:   Fri Jul 5 12:10:39 2024 +0100

AArch64: lower 2 reg TBL permutes with one zero register to 1 reg TBL.

When a two reg TBL is performed with one operand being a zero vector we can
instead use a single reg TBL and map the indices for accessing the zero 
vector
to an out of range constant.

On AArch64 out of range indices into a TBL have a defined semantics of 
setting
the element to zero.  Many uArches have a slower 2-reg TBL than 1-reg TBL.

Before this change we had:

typedef unsigned int v4si __attribute__ ((vector_size (16)));

v4si f1 (v4si a)
{
  v4si zeros = {0,0,0,0};
  return __builtin_shufflevector (a, zeros, 0, 5, 1, 6);
}

which generates:

f1:
mov v30.16b, v0.16b
moviv31.4s, 0
adrpx0, .LC0
ldr q0, [x0, #:lo12:.LC0]
tbl v0.16b, {v30.16b - v31.16b}, v0.16b
ret

.LC0:
.byte   0
.byte   1
.byte   2
.byte   3
.byte   20
.byte   21
.byte   22
.byte   23
.byte   4
.byte   5
.byte   6
.byte   7
.byte   24
.byte   25
.byte   26
.byte   27

and with the patch:

f1:
adrpx0, .LC0
ldr q31, [x0, #:lo12:.LC0]
tbl v0.16b, {v0.16b}, v31.16b
ret

.LC0:
.byte   0
.byte   1
.byte   2
.byte   3
.byte   -1
.byte   -1
.byte   -1
.byte   -1
.byte   4
.byte   5
.byte   6
.byte   7
.byte   -1
.byte   -1
.byte   -1
.byte   -1

This sequence is generated often by openmp and aside from the
strict performance impact of this change, it also gives better
register allocation as we no longer have the consecutive
register limitation.

gcc/ChangeLog:

* config/aarch64/aarch64.cc (struct expand_vec_perm_d): Add 
zero_op0_p
and zero_op_p1.
(aarch64_evpc_tbl): Implement register value remapping.
(aarch64_vectorize_vec_perm_const): Detect if operand is a zero dup
before it's forced to a reg.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/tbl_with_zero_1.c: New test.
* gcc.target/aarch64/tbl_with_zero_2.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64.cc  | 40 ++
 gcc/testsuite/gcc.target/aarch64/tbl_with_zero_1.c | 40 ++
 gcc/testsuite/gcc.target/aarch64/tbl_with_zero_2.c | 20 +++
 3 files changed, 94 insertions(+), 6 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 469eb938953..7f0cc47d0f0 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25413,6 +25413,7 @@ struct expand_vec_perm_d
   unsigned int vec_flags;
   unsigned int op_vec_flags;
   bool one_vector_p;
+  bool zero_op0_p, zero_op1_p;
   bool testing_p;
 };
 
@@ -25909,13 +25910,38 @@ aarch64_evpc_tbl (struct expand_vec_perm_d *d)
   /* to_constant is safe since this routine is specific to Advanced SIMD
  vectors.  */
   unsigned int nelt = d->perm.length ().to_constant ();
+
+  /* If one register is the constant vector of 0 then we only need
+ a one reg TBL and we map any accesses to the vector of 0 to -1.  We can't
+ do this earlier since vec_perm_indices clamps elements to within range so
+ we can only do it during codegen.  */
+  if (d->zero_op0_p)
+d->op0 = d->op1;
+  else if (d->zero_op1_p)
+d->op1 = d->op0;
+
   for (unsigned int i = 0; i < nelt; ++i)
-/* If big-endian and two vectors we end up with a weird mixed-endian
-   mode on NEON.  Reverse the index within each word but not the word
-   itself.  to_constant is safe because we checked is_constant above.  */
-rperm[i] = GEN_INT (BYTES_BIG_ENDIAN
-   ? d->perm[i].to_constant () ^ (nelt - 1)
-   : d->perm[i].to_constant ());
+{
+  auto val = d->perm[i].to_constant ();
+
+  /* If we're selecting from a 0 vector, we can just use an out of range
+index instead.  */
+  if ((d->zero_op0_p && val < nelt) || (d->zero_op1_p && val >= nelt))
+   rperm[i] = constm1_rtx;
+  else
+   {
+ /* If we are remapping a zero register as the first parameter we need
+to adjust the indices of the non-zero register.  */
+ if (d->zero_op0_p)
+   val = val % nelt;
+
+ /* If big-endian and two vectors we end up with a

[gcc r15-2099] middle-end: fix 0 offset creation and folding [PR115936]

2024-07-17 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:0135a90de5a99b51001b6152d8b548151ebfa1c3

commit r15-2099-g0135a90de5a99b51001b6152d8b548151ebfa1c3
Author: Tamar Christina 
Date:   Wed Jul 17 16:22:14 2024 +0100

middle-end: fix 0 offset creation and folding [PR115936]

As shown in PR115936 SCEV and IVOPTS create an invalidate IV when the IV is
a pointer type:

ivtmp.39_65 = ivtmp.39_59 + 0B;

where the IVs are DI mode and the offset is a pointer.
This comes from this weird candidate:

Candidate 8:
  Var befor: ivtmp.39_59
  Var after: ivtmp.39_65
  Incr POS: before exit test
  IV struct:
Type:   sizetype
Base:   0
Step:   0B
Biv:N
Overflowness wrto loop niter:   No-overflow

This IV was always created just ended up not being used.

This is created by SCEV.

simple_iv_with_niters in the case where no CHREC is found creates an IV with
base == ev, offset == 0;

however in this case EV is a POINTER_PLUS_EXPR and so the type is a pointer.
it ends up creating an unusable expression.

gcc/ChangeLog:

PR tree-optimization/115936
* tree-scalar-evolution.cc (simple_iv_with_niters): Use sizetype for
pointers.

Diff:
---
 gcc/tree-scalar-evolution.cc | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc
index 5aa95a2497a3..abb2bad77737 100644
--- a/gcc/tree-scalar-evolution.cc
+++ b/gcc/tree-scalar-evolution.cc
@@ -3243,7 +3243,11 @@ simple_iv_with_niters (class loop *wrto_loop, class loop 
*use_loop,
   if (tree_does_not_contain_chrecs (ev))
 {
   iv->base = ev;
-  iv->step = build_int_cst (TREE_TYPE (ev), 0);
+  tree ev_type = TREE_TYPE (ev);
+  if (POINTER_TYPE_P (ev_type))
+   ev_type = sizetype;
+
+  iv->step = build_int_cst (ev_type, 0);
   iv->no_overflow = true;
   return true;
 }

[gcc r15-2191] middle-end: Implement conditonal store vectorizer pattern [PR115531]

2024-07-22 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:af792f0226e479b165a49de5e8f9e1d16a4b26c0

commit r15-2191-gaf792f0226e479b165a49de5e8f9e1d16a4b26c0
Author: Tamar Christina 
Date:   Mon Jul 22 10:26:14 2024 +0100

middle-end: Implement conditonal store vectorizer pattern [PR115531]

This adds a conditional store optimization for the vectorizer as a pattern.
The vectorizer already supports modifying memory accesses because of the 
pattern
based gather/scatter recognition.

Doing it in the vectorizer allows us to still keep the ability to vectorize 
such
loops for architectures that don't have MASK_STORE support, whereas doing 
this
in ifcvt makes us commit to MASK_STORE.

Concretely for this loop:

void foo1 (char *restrict a, int *restrict b, int *restrict c, int n, int 
stride)
{
  if (stride <= 1)
return;

  for (int i = 0; i < n; i++)
{
  int res = c[i];
  int t = b[i+stride];
  if (a[i] != 0)
res = t;
  c[i] = res;
}
}

today we generate:

.L3:
ld1bz29.s, p7/z, [x0, x5]
ld1wz31.s, p7/z, [x2, x5, lsl 2]
ld1wz30.s, p7/z, [x1, x5, lsl 2]
cmpne   p15.b, p6/z, z29.b, #0
sel z30.s, p15, z30.s, z31.s
st1wz30.s, p7, [x2, x5, lsl 2]
add x5, x5, x4
whilelo p7.s, w5, w3
b.any   .L3

which in gimple is:

  vect_res_18.9_68 = .MASK_LOAD (vectp_c.7_65, 32B, loop_mask_67);
  vect_t_20.12_74 = .MASK_LOAD (vectp.10_72, 32B, loop_mask_67);
  vect__9.15_77 = .MASK_LOAD (vectp_a.13_75, 8B, loop_mask_67);
  mask__34.16_79 = vect__9.15_77 != { 0, ... };
  vect_res_11.17_80 = VEC_COND_EXPR ;
  .MASK_STORE (vectp_c.18_81, 32B, loop_mask_67, vect_res_11.17_80);

A MASK_STORE is already conditional, so there's no need to perform the load 
of
the old values and the VEC_COND_EXPR.  This patch makes it so we generate:

  vect_res_18.9_68 = .MASK_LOAD (vectp_c.7_65, 32B, loop_mask_67);
  vect__9.15_77 = .MASK_LOAD (vectp_a.13_75, 8B, loop_mask_67);
  mask__34.16_79 = vect__9.15_77 != { 0, ... };
  .MASK_STORE (vectp_c.18_81, 32B, mask__34.16_79, vect_res_18.9_68);

which generates:

.L3:
ld1bz30.s, p7/z, [x0, x5]
ld1wz31.s, p7/z, [x1, x5, lsl 2]
cmpne   p7.b, p7/z, z30.b, #0
st1wz31.s, p7, [x2, x5, lsl 2]
add x5, x5, x4
whilelo p7.s, w5, w3
b.any   .L3

gcc/ChangeLog:

PR tree-optimization/115531
* tree-vect-patterns.cc (vect_cond_store_pattern_same_ref): New.
(vect_recog_cond_store_pattern): New.
(vect_vect_recog_func_ptrs): Use it.
* target.def (conditional_operation_is_expensive): New.
* doc/tm.texi: Regenerate.
* doc/tm.texi.in: Document it.
* targhooks.cc (default_conditional_operation_is_expensive): New.
* targhooks.h (default_conditional_operation_is_expensive): New.

Diff:
---
 gcc/doc/tm.texi   |   7 ++
 gcc/doc/tm.texi.in|   2 +
 gcc/target.def|  12 
 gcc/targhooks.cc  |   8 +++
 gcc/targhooks.h   |   1 +
 gcc/tree-vect-patterns.cc | 159 ++
 6 files changed, 189 insertions(+)

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index f10d9a59c667..c7535d07f4dd 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6449,6 +6449,13 @@ The default implementation returns a 
@code{MODE_VECTOR_INT} with the
 same size and number of elements as @var{mode}, if such a mode exists.
 @end deftypefn
 
+@deftypefn {Target Hook} bool 
TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE (unsigned @var{ifn})
+This hook returns true if masked operation @var{ifn} (really of
+type @code{internal_fn}) should be considered more expensive to use than
+implementing the same operation without masking.  GCC can then try to use
+unconditional operations instead with extra selects.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE 
(unsigned @var{ifn})
 This hook returns true if masked internal function @var{ifn} (really of
 type @code{internal_fn}) should be considered expensive when the mask is
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 24596eb2f6b4..64cea3b1edaf 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4290,6 +4290,8 @@ address;  but often a machine-dependent strategy can 
generate better code.
 
 @hook TARGET_VECTORIZE_GET_MASK_MODE
 
+@hook TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE
+
 @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
 
 @hook TARGET_VECTORIZE_CREATE_COSTS
diff --git a/gcc/target.def b/gcc/target.def
index ce4d1ecd58be..3de1aad4c84d 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2033,6 +2033,18 @@ sam

[gcc r15-2192] AArch64: implement TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE [PR115531].

2024-07-22 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:0c5c0c959c2e592b84739f19ca771fa69eb8dfee

commit r15-2192-g0c5c0c959c2e592b84739f19ca771fa69eb8dfee
Author: Tamar Christina 
Date:   Mon Jul 22 10:28:19 2024 +0100

AArch64: implement TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE  
[PR115531].

This implements the new target hook indicating that for AArch64 when 
possible
we prefer masked operations for any type vs doing LOAD + SELECT or
SELECT + STORE.

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/115531
* config/aarch64/aarch64.cc
(aarch64_conditional_operation_is_expensive): New.
(TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE): New.

gcc/testsuite/ChangeLog:

PR tree-optimization/115531
* gcc.dg/vect/vect-conditional_store_1.c: New test.
* gcc.dg/vect/vect-conditional_store_2.c: New test.
* gcc.dg/vect/vect-conditional_store_3.c: New test.
* gcc.dg/vect/vect-conditional_store_4.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64.cc  | 12 ++
 .../gcc.dg/vect/vect-conditional_store_1.c | 24 +++
 .../gcc.dg/vect/vect-conditional_store_2.c | 24 +++
 .../gcc.dg/vect/vect-conditional_store_3.c | 24 +++
 .../gcc.dg/vect/vect-conditional_store_4.c | 28 ++
 5 files changed, 112 insertions(+)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 0d41a193ec18..89eb66348f77 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -28211,6 +28211,15 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
   return true;
 }
 
+/* Implement TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE.  Assume that
+   predicated operations when available are beneficial.  */
+
+static bool
+aarch64_conditional_operation_is_expensive (unsigned)
+{
+  return false;
+}
+
 /* Implement TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE.  Assume for now that
it isn't worth branching around empty masked ops (including masked
stores).  */
@@ -30898,6 +30907,9 @@ aarch64_libgcc_floating_mode_supported_p
 #define TARGET_VECTORIZE_RELATED_MODE aarch64_vectorize_related_mode
 #undef TARGET_VECTORIZE_GET_MASK_MODE
 #define TARGET_VECTORIZE_GET_MASK_MODE aarch64_get_mask_mode
+#undef TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE
+#define TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE \
+  aarch64_conditional_operation_is_expensive
 #undef TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
 #define TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE \
   aarch64_empty_mask_is_expensive
diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_1.c 
b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_1.c
new file mode 100644
index ..03128b1f19b2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_1.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_masked_store } */
+
+/* { dg-additional-options "-mavx2" { target avx2 } } */
+/* { dg-additional-options "-march=armv9-a" { target aarch64-*-* } } */
+
+void foo1 (char *restrict a, int *restrict b, int *restrict c, int n, int 
stride)
+{
+  if (stride <= 1)
+return;
+
+  for (int i = 0; i < n; i++)
+{
+  int res = c[i];
+  int t = b[i+stride];
+  if (a[i] != 0)
+res = t;
+  c[i] = res;
+}
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump-not "VEC_COND_EXPR " "vect" { target 
aarch64-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_2.c 
b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_2.c
new file mode 100644
index ..a03898793c0b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_2.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_masked_store } */
+
+/* { dg-additional-options "-mavx2" { target avx2 } } */
+/* { dg-additional-options "-march=armv9-a" { target aarch64-*-* } } */
+
+void foo2 (char *restrict a, int *restrict b, int *restrict c, int n, int 
stride)
+{
+  if (stride <= 1)
+return;
+
+  for (int i = 0; i < n; i++)
+{
+  int res = c[i];
+  int t = b[i+stride];
+  if (a[i] != 0)
+t = res;
+  c[i] = t;
+}
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump-not "VEC_COND_EXPR " "vect" { target 
aarch64-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_3.c 
b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_3.c
new file mode 100644
index ..8a898755c1ca
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_3.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg

[gcc r14-9493] match.pd: Only merge truncation with conversion for -fno-signed-zeros

2024-03-15 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:7dd3b2b09cbeb6712ec680a0445cb0ad41070423

commit r14-9493-g7dd3b2b09cbeb6712ec680a0445cb0ad41070423
Author: Joe Ramsay 
Date:   Fri Mar 15 09:20:45 2024 +

match.pd: Only merge truncation with conversion for -fno-signed-zeros

This optimisation does not honour signed zeros, so should not be
enabled except with -fno-signed-zeros.

gcc/ChangeLog:

* match.pd: Fix truncation pattern for -fno-signed-zeroes

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/no_merge_trunc_signed_zero.c: New test.

Diff:
---
 gcc/match.pd   |  1 +
 .../aarch64/no_merge_trunc_signed_zero.c   | 24 ++
 2 files changed, 25 insertions(+)

diff --git a/gcc/match.pd b/gcc/match.pd
index 9ce313323a3..15a1e7350d4 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -4858,6 +4858,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 (simplify
(float (fix_trunc @0))
(if (!flag_trapping_math
+   && !HONOR_SIGNED_ZEROS (type)
&& types_match (type, TREE_TYPE (@0))
&& direct_internal_fn_supported_p (IFN_TRUNC, type,
  OPTIMIZE_FOR_BOTH))
diff --git a/gcc/testsuite/gcc.target/aarch64/no_merge_trunc_signed_zero.c 
b/gcc/testsuite/gcc.target/aarch64/no_merge_trunc_signed_zero.c
new file mode 100644
index 000..b2c93e55567
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/no_merge_trunc_signed_zero.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-trapping-math -fsigned-zeros" } */
+
+#include 
+
+float
+f1 (float x)
+{
+  return (int) rintf(x);
+}
+
+double
+f2 (double x)
+{
+  return (long) rint(x);
+}
+
+/* { dg-final { scan-assembler "frintx\\ts\[0-9\]+, s\[0-9\]+" } } */
+/* { dg-final { scan-assembler "cvtzs\\ts\[0-9\]+, s\[0-9\]+" } } */
+/* { dg-final { scan-assembler "scvtf\\ts\[0-9\]+, s\[0-9\]+" } } */
+/* { dg-final { scan-assembler "frintx\\td\[0-9\]+, d\[0-9\]+" } } */
+/* { dg-final { scan-assembler "cvtzs\\td\[0-9\]+, d\[0-9\]+" } } */
+/* { dg-final { scan-assembler "scvtf\\td\[0-9\]+, d\[0-9\]+" } } */
+

[gcc r14-9969] middle-end: adjust loop upper bounds when peeling for gaps and early break [PR114403].

2024-04-15 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:85002f8085c25bb3e74ab013581a74e7c7ae006b

commit r14-9969-g85002f8085c25bb3e74ab013581a74e7c7ae006b
Author: Tamar Christina 
Date:   Mon Apr 15 12:06:21 2024 +0100

middle-end: adjust loop upper bounds when peeling for gaps and early break 
[PR114403].

This fixes a bug with the interaction between peeling for gaps and early 
break.

Before I go further, I'll first explain how I understand this to work for 
loops
with a single exit.

When peeling for gaps we peel N < VF iterations to scalar.
This happens by removing N iterations from the calculation of niters such 
that
vect_iters * VF == niters is always false.

In other words, when we exit the vector loop we always fall to the scalar 
loop.
The loop bounds adjustment guarantees this. Because of this we potentially
execute a vector loop iteration less.  That is, if you're at the boundary
condition where niters % VF by peeling one or more scalar iterations the 
vector
loop executes one less.

This is accounted for by the adjustments in vect_transform_loops.  This
adjustment happens differently based on whether the the vector loop can be
partial or not:

Peeling for gaps sets the bias to 0 and then:

when not partial:  we take the floor of (scalar_upper_bound / VF) - 1 to 
get the
   vector latch iteration count.

when loop is partial:  For a single exit this means the loop is masked, we 
take
   the ceil to account for the fact that the loop can 
handle
   the final partial iteration using masking.

Note that there's no difference between ceil an floor on the boundary 
condition.
There is a difference however when you're slightly above it. i.e. if scalar
iterates 14 times and VF = 4 and we peel 1 iteration for gaps.

The partial loop does ((13 + 0) / 4) - 1 == 2 vector iterations. and in 
effect
the partial iteration is ignored and it's done as scalar.

This is fine because the niters modification has capped the vector 
iteration at
2.  So that when we reduce the induction values you end up entering the 
scalar
code with ind_var.2 = ind_var.1 + 2 * VF.

Now lets look at early breaks.  To make it esier I'll focus on the specific
testcase:

char buffer[64];

__attribute__ ((noipa))
buff_t *copy (buff_t *first, buff_t *last)
{
  char *buffer_ptr = buffer;
  char *const buffer_end = &buffer[SZ-1];
  int store_size = sizeof(first->Val);
  while (first != last && (buffer_ptr + store_size) <= buffer_end)
{
  const char *value_data = (const char *)(&first->Val);
  __builtin_memcpy(buffer_ptr, value_data, store_size);
  buffer_ptr += store_size;
  ++first;
}

  if (first == last)
return 0;

  return first;
}

Here the first, early exit is on the condition:

  (buffer_ptr + store_size) <= buffer_end

and the main exit is on condition:

  first != last

This is important, as this bug only manifests itself when the first exit 
has a
known constant iteration count that's lower than the latch exit count.

because buffer holds 64 bytes, and VF = 4, unroll = 2, we end up processing 
16
bytes per iteration.  So the exit has a known bounds of 8 + 1.

The vectorizer correctly analizes this:

Statement (exit)if (ivtmp_21 != 0)
 is executed at most 8 (bounded by 8) + 1 times in loop 1.

and as a consequence the IV is bound by 9:

  # vect_vec_iv_.14_117 = PHI <_118(9), { 9, 8, 7, 6 }(20)>
  ...
  vect_ivtmp_21.16_124 = vect_vec_iv_.14_117 + { 18446744073709551615, 
18446744073709551615, 18446744073709551615, 18446744073709551615 };
  mask_patt_22.17_126 = vect_ivtmp_21.16_124 != { 0, 0, 0, 0 };
  if (mask_patt_22.17_126 == { -1, -1, -1, -1 })
goto ; [88.89%]
  else
goto ; [11.11%]

The imporant bits are this:

In this example the value of last - first = 416.

the calculated vector iteration count, is:

x = (((ptr2 - ptr1) - 16) / 16) + 1 = 27

the bounds generated, adjusting for gaps:

   x == (((x - 1) >> 2) << 2)

which means we'll always fall through to the scalar code. as intended.

Here are two key things to note:

1. In this loop, the early exit will always be the one taken.  When it's 
taken
   we enter the scalar loop with the correct induction value to apply the 
gap
   peeling.

2. If the main exit is taken, the induction values assumes you've finished 
all
   vector iterations.  i.e. it assumes you have completed 24 iterations, as 
we
   treat the main exit the same for normal loop vect and early break when 
not
   PEELED.
   This means the induction value is adjusted to ind_

[gcc r13-8604] AArch64: Do not allow SIMD clones with simdlen 1 [PR113552]

2024-04-15 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:1e08e39c743692afdd5d3546b2223474beac1dbc

commit r13-8604-g1e08e39c743692afdd5d3546b2223474beac1dbc
Author: Tamar Christina 
Date:   Mon Apr 15 12:11:48 2024 +0100

AArch64: Do not allow SIMD clones with simdlen 1 [PR113552]

This is a backport of g:306713c953d509720dc394c43c0890548bb0ae07.

The AArch64 vector PCS does not allow simd calls with simdlen 1,
however due to a bug we currently do allow it for num == 0.

This causes us to emit a symbol that doesn't exist and we fail to link.

gcc/ChangeLog:

PR tree-optimization/113552
* config/aarch64/aarch64.cc
(aarch64_simd_clone_compute_vecsize_and_simdlen): Block simdlen 1.

gcc/testsuite/ChangeLog:

PR tree-optimization/113552
* gcc.target/aarch64/pr113552.c: New test.
* gcc.target/aarch64/simd_pcs_attribute-3.c: Remove bogus check.

Diff:
---
 gcc/config/aarch64/aarch64.cc   | 16 +---
 gcc/testsuite/gcc.target/aarch64/pr113552.c | 17 +
 gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c |  4 ++--
 3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index f6d14cd791a..b8a4ab1b980 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -27029,7 +27029,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
bool explicit_p)
 {
   tree t, ret_type;
-  unsigned int elt_bits, count;
+  unsigned int elt_bits, count = 0;
   unsigned HOST_WIDE_INT const_simdlen;
   poly_uint64 vec_bits;
 
@@ -27102,8 +27102,17 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
   elt_bits = GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type));
   if (known_eq (clonei->simdlen, 0U))
 {
-  count = 2;
-  vec_bits = (num == 0 ? 64 : 128);
+  /* We don't support simdlen == 1.  */
+  if (known_eq (elt_bits, 64))
+   {
+ count = 1;
+ vec_bits = 128;
+   }
+  else
+   {
+ count = 2;
+ vec_bits = (num == 0 ? 64 : 128);
+   }
   clonei->simdlen = exact_div (vec_bits, elt_bits);
 }
   else
@@ -27123,6 +27132,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
  return 0;
}
 }
+
   clonei->vecsize_int = vec_bits;
   clonei->vecsize_float = vec_bits;
   return count;
diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c 
b/gcc/testsuite/gcc.target/aarch64/pr113552.c
new file mode 100644
index 000..9c96b061ed2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=armv8-a" } */
+
+__attribute__ ((__simd__ ("notinbranch"), const))
+double cos (double);
+
+void foo (float *a, double *b)
+{
+for (int i = 0; i < 12; i+=3)
+  {
+b[i] = cos (5.0 * a[i]);
+b[i+1] = cos (5.0 * a[i+1]);
+b[i+2] = cos (5.0 * a[i+2]);
+  }
+}
+
+/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c 
b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
index 95f6a6803e8..c6dac6b104c 100644
--- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
+++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
@@ -18,7 +18,7 @@ double foo(double x)
 }
 
 /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */
-/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */
+/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnM1v_foo} } } */
 /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */
-/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */
+/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnN1v_foo} } } */
 /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */

[gcc r12-10329] AArch64: Do not allow SIMD clones with simdlen 1 [PR113552]

2024-04-15 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:642cfd049780f03335da9fe0a51415f130232334

commit r12-10329-g642cfd049780f03335da9fe0a51415f130232334
Author: Tamar Christina 
Date:   Mon Apr 15 12:16:53 2024 +0100

AArch64: Do not allow SIMD clones with simdlen 1 [PR113552]

This is a backport of g:306713c953d509720dc394c43c0890548bb0ae07.

The AArch64 vector PCS does not allow simd calls with simdlen 1,
however due to a bug we currently do allow it for num == 0.

This causes us to emit a symbol that doesn't exist and we fail to link.

gcc/ChangeLog:

PR tree-optimization/113552
* config/aarch64/aarch64.cc
(aarch64_simd_clone_compute_vecsize_and_simdlen): Block simdlen 1.

gcc/testsuite/ChangeLog:

PR tree-optimization/113552
* gcc.target/aarch64/pr113552.c: New test.
* gcc.target/aarch64/simd_pcs_attribute-3.c: Remove bogus check.

Diff:
---
 gcc/config/aarch64/aarch64.cc   | 16 +---
 gcc/testsuite/gcc.target/aarch64/pr113552.c | 17 +
 gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c |  4 ++--
 3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 2bbba323770..96976abdbf4 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26898,7 +26898,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
tree base_type, int num)
 {
   tree t, ret_type;
-  unsigned int elt_bits, count;
+  unsigned int elt_bits, count = 0;
   unsigned HOST_WIDE_INT const_simdlen;
   poly_uint64 vec_bits;
 
@@ -26966,8 +26966,17 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
   elt_bits = GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type));
   if (known_eq (clonei->simdlen, 0U))
 {
-  count = 2;
-  vec_bits = (num == 0 ? 64 : 128);
+  /* We don't support simdlen == 1.  */
+  if (known_eq (elt_bits, 64))
+   {
+ count = 1;
+ vec_bits = 128;
+   }
+  else
+   {
+ count = 2;
+ vec_bits = (num == 0 ? 64 : 128);
+   }
   clonei->simdlen = exact_div (vec_bits, elt_bits);
 }
   else
@@ -26985,6 +26994,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
  return 0;
}
 }
+
   clonei->vecsize_int = vec_bits;
   clonei->vecsize_float = vec_bits;
   return count;
diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c 
b/gcc/testsuite/gcc.target/aarch64/pr113552.c
new file mode 100644
index 000..9c96b061ed2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=armv8-a" } */
+
+__attribute__ ((__simd__ ("notinbranch"), const))
+double cos (double);
+
+void foo (float *a, double *b)
+{
+for (int i = 0; i < 12; i+=3)
+  {
+b[i] = cos (5.0 * a[i]);
+b[i+1] = cos (5.0 * a[i+1]);
+b[i+2] = cos (5.0 * a[i+2]);
+  }
+}
+
+/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c 
b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
index 95f6a6803e8..c6dac6b104c 100644
--- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
+++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
@@ -18,7 +18,7 @@ double foo(double x)
 }
 
 /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */
-/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */
+/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnM1v_foo} } } */
 /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */
-/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */
+/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnN1v_foo} } } */
 /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */

[gcc r11-11323] [AArch64]: Do not allow SIMD clones with simdlen 1 [PR113552]

2024-04-15 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:0c2fcf3ddfe93d1f403962c4bacbb5d55ab7d19d

commit r11-11323-g0c2fcf3ddfe93d1f403962c4bacbb5d55ab7d19d
Author: Tamar Christina 
Date:   Mon Apr 15 12:32:24 2024 +0100

[AArch64]: Do not allow SIMD clones with simdlen 1 [PR113552]

This is a backport of g:306713c953d509720dc394c43c0890548bb0ae07.

The AArch64 vector PCS does not allow simd calls with simdlen 1,
however due to a bug we currently do allow it for num == 0.

This causes us to emit a symbol that doesn't exist and we fail to link.

gcc/ChangeLog:

PR tree-optimization/113552
* config/aarch64/aarch64.c
(aarch64_simd_clone_compute_vecsize_and_simdlen): Block simdlen 1.

gcc/testsuite/ChangeLog:

PR tree-optimization/113552
* gcc.target/aarch64/pr113552.c: New test.
* gcc.target/aarch64/simd_pcs_attribute-3.c: Remove bogus check.

Diff:
---
 gcc/config/aarch64/aarch64.c   | 18 ++
 gcc/testsuite/gcc.target/aarch64/pr113552.c| 17 +
 .../gcc.target/aarch64/simd_pcs_attribute-3.c  |  4 ++--
 3 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 9bbbc5043af..4df72339952 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -25556,7 +25556,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
tree base_type, int num)
 {
   tree t, ret_type;
-  unsigned int elt_bits, count;
+  unsigned int elt_bits, count = 0;
   unsigned HOST_WIDE_INT const_simdlen;
   poly_uint64 vec_bits;
 
@@ -25624,11 +25624,20 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
(struct cgraph_node *node,
   elt_bits = GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type));
   if (known_eq (clonei->simdlen, 0U))
 {
-  count = 2;
-  vec_bits = (num == 0 ? 64 : 128);
+  /* We don't support simdlen == 1.  */
+  if (known_eq (elt_bits, 64))
+   {
+ count = 1;
+ vec_bits = 128;
+   }
+  else
+   {
+ count = 2;
+ vec_bits = (num == 0 ? 64 : 128);
+   }
   clonei->simdlen = exact_div (vec_bits, elt_bits);
 }
-  else
+  else if (maybe_ne (clonei->simdlen, 1U))
 {
   count = 1;
   vec_bits = clonei->simdlen * elt_bits;
@@ -25643,6 +25652,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
  return 0;
}
 }
+
   clonei->vecsize_int = vec_bits;
   clonei->vecsize_float = vec_bits;
   return count;
diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c 
b/gcc/testsuite/gcc.target/aarch64/pr113552.c
new file mode 100644
index 000..9c96b061ed2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=armv8-a" } */
+
+__attribute__ ((__simd__ ("notinbranch"), const))
+double cos (double);
+
+void foo (float *a, double *b)
+{
+for (int i = 0; i < 12; i+=3)
+  {
+b[i] = cos (5.0 * a[i]);
+b[i+1] = cos (5.0 * a[i+1]);
+b[i+2] = cos (5.0 * a[i+2]);
+  }
+}
+
+/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c 
b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
index 95f6a6803e8..c6dac6b104c 100644
--- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
+++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
@@ -18,7 +18,7 @@ double foo(double x)
 }
 
 /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */
-/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */
+/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnM1v_foo} } } */
 /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */
-/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */
+/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnN1v_foo} } } */
 /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */

[gcc r14-9997] testsuite: Fix data check loop on vect-early-break_124-pr114403.c

2024-04-16 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:f438acf7ce2e6cb862cf62f2543c36639e2af233

commit r14-9997-gf438acf7ce2e6cb862cf62f2543c36639e2af233
Author: Tamar Christina 
Date:   Tue Apr 16 20:56:26 2024 +0100

testsuite: Fix data check loop on vect-early-break_124-pr114403.c

The testcase had the wrong indices in the buffer check loop.

gcc/testsuite/ChangeLog:

PR tree-optimization/114403
* gcc.dg/vect/vect-early-break_124-pr114403.c: Fix check loop.

Diff:
---
 gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
index 1751296ab81..51abf245ccb 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
@@ -68,8 +68,8 @@ int main ()
 
   int store_size = sizeof(PV);
 #pragma GCC novector
-  for (int i = 0; i < NUM - 1; i+=store_size)
-if (0 != __builtin_memcmp (buffer+i, (char*)&tmp[i].Val, store_size))
+  for (int i = 0; i < NUM - 1; i++)
+if (0 != __builtin_memcmp (buffer+(i*store_size), (char*)&tmp[i].Val, 
store_size))
   __builtin_abort ();
 
   return 0;

[gcc r14-10014] AArch64: remove reliance on register allocator for simd/gpreg costing. [PR114741]

2024-04-18 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:a2f4be3dae04fa8606d1cc8451f0b9d450f7e6e6

commit r14-10014-ga2f4be3dae04fa8606d1cc8451f0b9d450f7e6e6
Author: Tamar Christina 
Date:   Thu Apr 18 11:47:42 2024 +0100

AArch64: remove reliance on register allocator for simd/gpreg costing. 
[PR114741]

In PR114741 we see that we have a regression in codegen when SVE is enable 
where
the simple testcase:

void foo(unsigned v, unsigned *p)
{
*p = v & 1;
}

generates

foo:
fmovs31, w0
and z31.s, z31.s, #1
str s31, [x1]
ret

instead of:

foo:
and w0, w0, 1
str w0, [x1]
ret

This causes an impact it not just codesize but also performance.  This is 
caused
by the use of the ^ constraint modifier in the pattern 3.

The documentation states that this modifier should only have an effect on 
the
alternative costing in that a particular alternative is to be preferred 
unless
a non-psuedo reload is needed.

The pattern was trying to convey that whenever both r and w are required, 
that
it should prefer r unless a reload is needed.  This is because if a reload 
is
needed then we can construct the constants more flexibly on the SIMD side.

We were using this so simplify the implementation and to get generic cases 
such
as:

double negabs (double x)
{
   unsigned long long y;
   memcpy (&y, &x, sizeof(double));
   y = y | (1UL << 63);
   memcpy (&x, &y, sizeof(double));
   return x;
}

which don't go through an expander.
However the implementation of ^ in the register allocator is not according 
to
the documentation in that it also has an effect during coloring.  During 
initial
register class selection it applies a penalty to a class, similar to how ? 
does.

In this example the penalty makes the use of GP regs expensive enough that 
it no
longer considers them:

r106: preferred FP_REGS, alternative NO_REGS, allocno FP_REGS
;;3--> b  0: i   9 r106=r105&0x1
:cortex_a53_slot_any:GENERAL_REGS+0(-1)FP_REGS+1(1)PR_LO_REGS+0(0)
 PR_HI_REGS+0(0):model 4

which is not the expected behavior.  For GCC 14 this is a conservative fix.

1. we remove the ^ modifier from the logical optabs.

2. In order not to regress copysign we then move the copysign expansion to
   directly use the SIMD variant.  Since copysign only supports floating 
point
   modes this is fine and no longer relies on the register allocator to 
select
   the right alternative.

It once again regresses the general case, but this case wasn't optimized in
earlier GCCs either so it's not a regression in GCC 14.  This change gives
strict better codegen than earlier GCCs and still optimizes the important 
cases.

gcc/ChangeLog:

PR target/114741
* config/aarch64/aarch64.md (3): Remove ^ from alt 2.
(copysign3): Use SIMD version of IOR directly.

gcc/testsuite/ChangeLog:

PR target/114741
* gcc.target/aarch64/fneg-abs_2.c: Update codegen.
* gcc.target/aarch64/fneg-abs_4.c: xfail for now.
* gcc.target/aarch64/pr114741.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64.md | 23 +
 gcc/testsuite/gcc.target/aarch64/fneg-abs_2.c |  5 ++---
 gcc/testsuite/gcc.target/aarch64/fneg-abs_4.c |  4 ++--
 gcc/testsuite/gcc.target/aarch64/pr114741.c   | 29 +++
 4 files changed, 48 insertions(+), 13 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 385a669b9b3..dbde066f747 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4811,7 +4811,7 @@
   ""
   {@ [ cons: =0 , 1  , 2; attrs: type , arch  ]
  [ r, %r , r; logic_reg   , * ] \t%0, 
%1, %2
- [ rk   , ^r ,  ; logic_imm   , * ] \t%0, 
%1, %2
+ [ rk   , r  ,  ; logic_imm   , * ] \t%0, 
%1, %2
  [ w, 0  ,  ; *   , sve   ] \t%Z0., 
%Z0., #%2
  [ w, w  , w; neon_logic  , simd  ] 
\t%0., %1., %2.
   }
@@ -7192,22 +7192,29 @@
(match_operand:GPF 2 "nonmemory_operand")]
   "TARGET_SIMD"
 {
-  machine_mode int_mode = mode;
-  rtx bitmask = gen_reg_rtx (int_mode);
-  emit_move_insn (bitmask, GEN_INT (HOST_WIDE_INT_M1U
-   << (GET_MODE_BITSIZE (mode) - 1)));
+  rtx signbit_const = GEN_INT (HOST_WIDE_INT_M1U
+  << (GET_MODE_BITSIZE (mode) - 1));
   /* copysign (x, -1) should instead be expanded as orr with the sign
  bit.  */
   rtx op2_elt = unwrap_const_vec_duplicate (operands[2]);
   if (GET_CODE (op2_elt) == CONST_DOUBLE
   && real_isneg (CONST_DOUBLE_REAL_VALUE (op2_e

[gcc r14-10040] middle-end: refactory vect_recog_absolute_difference to simplify flow [PR114769]

2024-04-19 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:1216460e7023cd8ec49933866107417c70e933c9

commit r14-10040-g1216460e7023cd8ec49933866107417c70e933c9
Author: Tamar Christina 
Date:   Fri Apr 19 15:22:13 2024 +0100

middle-end: refactory vect_recog_absolute_difference to simplify flow 
[PR114769]

Hi All,

As the reporter in PR114769 points out the control flow for the abd 
detection
is hard to follow.  This is because vect_recog_absolute_difference has two
different ways it can return true.

1. It can return true when the widening operation is matched, in which case
   unprom is set, half_type is not NULL and diff_stmt is not set.

2. It can return true when the widening operation is not matched, but the 
stmt
   being checked is a minus.  In this case unprom is not set, half_type is 
set
   to NULL and diff_stmt is set.  This because to get to diff_stmt you have 
to
   dig through the abs statement and any possible promotions.

This however leads to complicated uses of the function at the call sites as 
the
exact semantic needs to be known to use it safely.

vect_recog_absolute_difference has two callers:

1. vect_recog_sad_pattern where if you return true with unprom not set, then
   *half_type will be NULL.  The call to vect_supportable_direct_optab_p 
will
   always reject it since there's no vector mode for NULL.  Note that if 
looking
   at the dump files, the convention in the dump files have always been 
that we
   first indicate that a pattern could possibly be recognize and then check 
that
   it's supported.

   This change somewhat incorrectly makes the diagnostic message get 
printed for
   "invalid" patterns.

2. vect_recog_abd_pattern, where if half_type is NULL, it then uses 
diff_stmt to
   set them.

This refactors the code, it now only has 1 success condition, and diff_stmt 
is
always set to the minus statement in the abs if there is one.

The function now only returns success if the widening minus is found, in 
which
case unprom and half_type set.

This then leaves it up to the caller to decide if they want to do anything 
with
diff_stmt.

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/114769
* tree-vect-patterns.cc:
(vect_recog_absolute_difference): Have only one success condition.
(vect_recog_abd_pattern): Handle further checks if
vect_recog_absolute_difference fails.

Diff:
---
 gcc/tree-vect-patterns.cc | 43 ---
 1 file changed, 16 insertions(+), 27 deletions(-)

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 4f491c6b833..87c2acff386 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -797,8 +797,7 @@ vect_split_statement (vec_info *vinfo, stmt_vec_info 
stmt2_info, tree new_rhs,
HALF_TYPE and UNPROM will be set should the statement be found to
be a widened operation.
DIFF_STMT will be set to the MINUS_EXPR
-   statement that precedes the ABS_STMT unless vect_widened_op_tree
-   succeeds.
+   statement that precedes the ABS_STMT if it is a MINUS_EXPR..
  */
 static bool
 vect_recog_absolute_difference (vec_info *vinfo, gassign *abs_stmt,
@@ -843,6 +842,12 @@ vect_recog_absolute_difference (vec_info *vinfo, gassign 
*abs_stmt,
   if (!diff_stmt_vinfo)
 return false;
 
+  gassign *diff = dyn_cast  (STMT_VINFO_STMT (diff_stmt_vinfo));
+  if (diff_stmt && diff
+  && gimple_assign_rhs_code (diff) == MINUS_EXPR
+  && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (abs_oprnd)))
+*diff_stmt = diff;
+
   /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
  inside the loop (in case we are analyzing an outer-loop).  */
   if (vect_widened_op_tree (vinfo, diff_stmt_vinfo,
@@ -850,17 +855,6 @@ vect_recog_absolute_difference (vec_info *vinfo, gassign 
*abs_stmt,
false, 2, unprom, half_type))
 return true;
 
-  /* Failed to find a widen operation so we check for a regular MINUS_EXPR.  */
-  gassign *diff = dyn_cast  (STMT_VINFO_STMT (diff_stmt_vinfo));
-  if (diff_stmt && diff
-  && gimple_assign_rhs_code (diff) == MINUS_EXPR
-  && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (abs_oprnd)))
-{
-  *diff_stmt = diff;
-  *half_type = NULL_TREE;
-  return true;
-}
-
   return false;
 }
 
@@ -1499,27 +1493,22 @@ vect_recog_abd_pattern (vec_info *vinfo,
   tree out_type = TREE_TYPE (gimple_assign_lhs (last_stmt));
 
   vect_unpromoted_value unprom[2];
-  gassign *diff_stmt;
-  tree half_type;
-  if (!vect_recog_absolute_difference (vinfo, last_stmt, &half_type,
+  gassign *diff_stmt = NULL;
+  tree abd_in_type;
+  if (!vect_recog_absolute_difference (vinfo, last_stmt, &abd_in_type,
   unprom, &diff_stmt))
-return NULL;
-
-  tree abd_in_type, abd_out_type;
-
-

[gcc r15-2336] middle-end: check for vector mode before calling get_mask_mode [PR116074]

2024-07-26 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:29e4e4bdb674118b898d50ce7751c183aa0a44ee

commit r15-2336-g29e4e4bdb674118b898d50ce7751c183aa0a44ee
Author: Tamar Christina 
Date:   Fri Jul 26 13:02:53 2024 +0100

middle-end: check for vector mode before calling get_mask_mode [PR116074]

For historical reasons AArch64 has TI mode vector types but does not 
consider
TImode a vector mode.

What's happening in the PR is that get_vectype_for_scalar_type is returning
vector(1) TImode for a TImode scalar.  This then fails when we call
targetm.vectorize.get_mask_mode (vecmode).exists (&) on the TYPE_MODE.

This checks for vector mode before using the results of
get_vectype_for_scalar_type.

gcc/ChangeLog:

PR target/116074
* tree-vect-patterns.cc (vect_recog_cond_store_pattern): Check 
vector mode.

gcc/testsuite/ChangeLog:

PR target/116074
* g++.target/aarch64/pr116074.C: New test.

Diff:
---
 gcc/testsuite/g++.target/aarch64/pr116074.C | 24 
 gcc/tree-vect-patterns.cc   |  3 ++-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/g++.target/aarch64/pr116074.C 
b/gcc/testsuite/g++.target/aarch64/pr116074.C
new file mode 100644
index ..54cf561510c4
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/pr116074.C
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3" } */
+
+int m[40];
+
+template  struct j {
+  int length;
+  k *e;
+  void operator[](int) {
+if (length)
+  __builtin___memcpy_chk(m, m+3, sizeof (k), -1);
+  }
+};
+
+j> o;
+
+int *q;
+
+void ao(int i) {
+  for (; i > 0; i--) {
+o[1];
+*q = 1;
+  }
+}
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index b0821c74c1d8..5fbd1a4fa6b4 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -6624,7 +6624,8 @@ vect_recog_cond_store_pattern (vec_info *vinfo,
 
   machine_mode mask_mode;
   machine_mode vecmode = TYPE_MODE (vectype);
-  if (targetm.vectorize.conditional_operation_is_expensive (IFN_MASK_STORE)
+  if (!VECTOR_MODE_P (vecmode)
+  || targetm.vectorize.conditional_operation_is_expensive (IFN_MASK_STORE)
   || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
   || !can_vec_mask_load_store_p (vecmode, mask_mode, false))
 return NULL;

[gcc r15-2638] AArch64: Update Neoverse V2 cost model to release costs

2024-08-01 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:7e7c1e38829d45667748db68f15584bdd16fcad6

commit r15-2638-g7e7c1e38829d45667748db68f15584bdd16fcad6
Author: Tamar Christina 
Date:   Thu Aug 1 16:53:22 2024 +0100

AArch64: Update Neoverse V2 cost model to release costs

This updates the cost for Neoverse V2 to reflect the updated
Software Optimization Guide.

It also makes Cortex-X3 use the Neoverse V2 cost model.

gcc/ChangeLog:

* config/aarch64/aarch64-cores.def (cortex-x3): Use Neoverse-V2 
costs.
* config/aarch64/tuning_models/neoversev2.h: Update costs.

Diff:
---
 gcc/config/aarch64/aarch64-cores.def  |  2 +-
 gcc/config/aarch64/tuning_models/neoversev2.h | 38 +--
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index e58bc0f27de3..34307fe0c172 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -186,7 +186,7 @@ AARCH64_CORE("cortex-a720",  cortexa720, cortexa57, V9_2A,  
(SVE2_BITPERM, MEMTA
 
 AARCH64_CORE("cortex-x2",  cortexx2, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
I8MM, BF16), neoversen2, 0x41, 0xd48, -1)
 
-AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
I8MM, BF16), neoversen2, 0x41, 0xd4e, -1)
+AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
I8MM, BF16), neoversev2, 0x41, 0xd4e, -1)
 
 AARCH64_CORE("cortex-x4",  cortexx4, cortexa57, V9_2A,  (SVE2_BITPERM, MEMTAG, 
PROFILE), neoversen2, 0x41, 0xd81, -1)
 
diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
b/gcc/config/aarch64/tuning_models/neoversev2.h
index f76e4ef358f7..c9c3019dd01a 100644
--- a/gcc/config/aarch64/tuning_models/neoversev2.h
+++ b/gcc/config/aarch64/tuning_models/neoversev2.h
@@ -57,13 +57,13 @@ static const advsimd_vec_cost 
neoversev2_advsimd_vector_cost =
   2, /* ld2_st2_permute_cost */
   2, /* ld3_st3_permute_cost  */
   3, /* ld4_st4_permute_cost  */
-  3, /* permute_cost  */
+  2, /* permute_cost  */
   4, /* reduc_i8_cost  */
   4, /* reduc_i16_cost  */
   2, /* reduc_i32_cost  */
   2, /* reduc_i64_cost  */
   6, /* reduc_f16_cost  */
-  3, /* reduc_f32_cost  */
+  4, /* reduc_f32_cost  */
   2, /* reduc_f64_cost  */
   2, /* store_elt_extra_cost  */
   /* This value is just inherited from the Cortex-A57 table.  */
@@ -86,22 +86,22 @@ static const sve_vec_cost neoversev2_sve_vector_cost =
   {
 2, /* int_stmt_cost  */
 2, /* fp_stmt_cost  */
-3, /* ld2_st2_permute_cost  */
+2, /* ld2_st2_permute_cost  */
 3, /* ld3_st3_permute_cost  */
-4, /* ld4_st4_permute_cost  */
-3, /* permute_cost  */
+3, /* ld4_st4_permute_cost  */
+2, /* permute_cost  */
 /* Theoretically, a reduction involving 15 scalar ADDs could
-   complete in ~3 cycles and would have a cost of 15.  [SU]ADDV
-   completes in 11 cycles, so give it a cost of 15 + 8.  */
-21, /* reduc_i8_cost  */
-/* Likewise for 7 scalar ADDs (~2 cycles) vs. 9: 7 + 7.  */
-14, /* reduc_i16_cost  */
-/* Likewise for 3 scalar ADDs (~2 cycles) vs. 8: 3 + 4.  */
+   complete in ~5 cycles and would have a cost of 15.  [SU]ADDV
+   completes in 9 cycles, so give it a cost of 15 + 4.  */
+19, /* reduc_i8_cost  */
+/* Likewise for 7 scalar ADDs (~3 cycles) vs. 8: 7 + 5.  */
+12, /* reduc_i16_cost  */
+/* Likewise for 3 scalar ADDs (~2 cycles) vs. 6: 3 + 4.  */
 7, /* reduc_i32_cost  */
-/* Likewise for 1 scalar ADD (~1 cycles) vs. 2: 1 + 1.  */
-2, /* reduc_i64_cost  */
+/* Likewise for 1 scalar ADDs (~1 cycles) vs. 4: 1 + 3.  */
+4, /* reduc_i64_cost  */
 /* Theoretically, a reduction involving 7 scalar FADDs could
-   complete in ~6 cycles and would have a cost of 14.  FADDV
+   complete in ~6 cycles and would have a cost of  14.  FADDV
completes in 8 cycles, so give it a cost of 14 + 2.  */
 16, /* reduc_f16_cost  */
 /* Likewise for 3 scalar FADDs (~4 cycles) vs. 6: 6 + 2.  */
@@ -127,7 +127,7 @@ static const sve_vec_cost neoversev2_sve_vector_cost =
   /* A strided Advanced SIMD x64 load would take two parallel FP loads
  (8 cycles) plus an insertion (2 cycles).  Assume a 64-bit SVE gather
  is 1 cycle more.  The Advanced SIMD version is costed as 2 scalar loads
- (cost 8) and a vec_construct (cost 2).  Add a full vector operation
+ (cost 8) and a vec_construct (cost 4).  Add a full vector operation
  (cost 2) to that, to avoid the difference being lost in rounding.
 
  There is no easy comparison between a strided Advanced SIMD x32 load
@@ -165,14 +165,14 @@ static const aarch64_sve_vec_issue_info 
neoversev2_sve_issue_info =
 {
   {
 {
-  3, /* loads_per_cycle  */
+  3, /* loads_stores_per_cycle  */
   2, /* stores_per_cycle  */
   4, /* general_ops_per_cycle  */
   0, /* fp_simd_load_general_ops  */
   1 /* fp_simd_sto

[gcc r15-2640] AArch64: Add Neoverse V3AE core definition and cost model

2024-08-01 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:7ca2a803c4a0d8e894f0b36625a2c838c54fb4cd

commit r15-2640-g7ca2a803c4a0d8e894f0b36625a2c838c54fb4cd
Author: Tamar Christina 
Date:   Thu Aug 1 16:53:59 2024 +0100

AArch64: Add Neoverse V3AE core definition and cost model

This adds a cost model and core definition for Neoverse V3AE.

gcc/ChangeLog:

* config/aarch64/aarch64-cores.def (neoverse-v3ae): New.
* config/aarch64/aarch64-tune.md: Regenerate.
* config/aarch64/tuning_models/neoversev3ae.h: New file.
* config/aarch64/aarch64.cc: Use it.
* doc/invoke.texi: Document it.

Diff:
---
 gcc/config/aarch64/aarch64-cores.def|   1 +
 gcc/config/aarch64/aarch64-tune.md  |   2 +-
 gcc/config/aarch64/aarch64.cc   |   1 +
 gcc/config/aarch64/tuning_models/neoversev3ae.h | 246 
 gcc/doc/invoke.texi |   2 +-
 5 files changed, 250 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 96c74657a199..092be6eb01e6 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -196,6 +196,7 @@ AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, V9A, 
(I8MM, BF16, SVE2_BITPER
 AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 AARCH64_CORE("grace", grace, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 AARCH64_CORE("neoverse-v3", neoversev3, cortexa57, V9_2A, (SVE2_BITPERM, RNG, 
LS64, MEMTAG, PROFILE), neoversev3, 0x41, 0xd84, -1)
+AARCH64_CORE("neoverse-v3ae", neoversev3ae, cortexa57, V9_2A, (SVE2_BITPERM, 
RNG, LS64, MEMTAG, PROFILE), neoversev3ae, 0x41, 0xd83, -1)
 
 AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index 0c3339b53e42..b02e891086cc 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,neoversev3,demeter,generic,generic_armv8_a,generic_armv9_a"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,neoversev3,neoversev3ae,demeter,generic,generic_armv8_a,generic_armv9_a"
(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index f29dcf7fe173..54b27cdff43b 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -411,6 +411,7 @@ static const struct aarch64_flag_desc 
aarch64_tuning_flags[] =
 #include "tuning_models/neoversen2.h"
 #include "tuning_models/neoversev2.h"
 #include "tuning_models/neoversev3.h"
+#include "tuning_models/neoversev3ae.h"
 #include "tuning_models/a64fx.h"
 
 /* Support for fine-grained override of the tuning structures.  */
diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
b/gcc/config/aarch64/tuning_models/neoversev3ae.h
new file mode 100644
index ..96d7ccf03cd9
--- /dev/null
+++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
@@ -0,0 +1,246 @@
+/* T

[gcc r15-2641] AArch64: Add Neoverse N3 and Cortex-A725 core definition and cost model

2024-08-01 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:488395f9513233944e488fae59372da4de4324c3

commit r15-2641-g488395f9513233944e488fae59372da4de4324c3
Author: Tamar Christina 
Date:   Thu Aug 1 16:54:15 2024 +0100

AArch64: Add Neoverse N3 and Cortex-A725 core definition and cost model

This adds a cost model and core definition for Neoverse N3 and Cortex-A725.

It also makes Cortex-A725 use the Neoverse N3 cost model.

gcc/ChangeLog:

* config/aarch64/aarch64-cores.def (neoverse-n3, cortex-a725): New.
* config/aarch64/aarch64-tune.md: Regenerate.
* config/aarch64/tuning_models/neoversen3.h: New file.
* config/aarch64/aarch64.cc: Use it.
* doc/invoke.texi: Document it.

Diff:
---
 gcc/config/aarch64/aarch64-cores.def  |   2 +
 gcc/config/aarch64/aarch64-tune.md|   2 +-
 gcc/config/aarch64/aarch64.cc |   1 +
 gcc/config/aarch64/tuning_models/neoversen3.h | 245 ++
 gcc/doc/invoke.texi   |   3 +-
 5 files changed, 251 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 092be6eb01e6..4d6f5a701eee 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -183,6 +183,7 @@ AARCH64_CORE("cortex-a710",  cortexa710, cortexa57, V9A,  
(SVE2_BITPERM, MEMTAG,
 AARCH64_CORE("cortex-a715",  cortexa715, cortexa57, V9A,  (SVE2_BITPERM, 
MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd4d, -1)
 
 AARCH64_CORE("cortex-a720",  cortexa720, cortexa57, V9_2A,  (SVE2_BITPERM, 
MEMTAG, PROFILE), neoversen2, 0x41, 0xd81, -1)
+AARCH64_CORE("cortex-a725",  cortexa725, cortexa57, V9_2A, (SVE2_BITPERM, 
MEMTAG, PROFILE), neoversen3, 0x41, 0xd87, -1)
 
 AARCH64_CORE("cortex-x2",  cortexx2, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
I8MM, BF16), neoversen2, 0x41, 0xd48, -1)
 
@@ -192,6 +193,7 @@ AARCH64_CORE("cortex-x4",  cortexx4, cortexa57, V9_2A,  
(SVE2_BITPERM, MEMTAG, P
 
 AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1)
 AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1)
+AARCH64_CORE("neoverse-n3", neoversen3, cortexa57, V9_2A, (SVE2_BITPERM, RNG, 
MEMTAG, PROFILE), neoversen3, 0x41, 0xd8e, -1)
 
 AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 AARCH64_CORE("grace", grace, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE), neoversev2, 0x41, 0xd4f, -1)
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index b02e891086cc..d71c631b01c7 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,neoversev3,neoversev3ae,demeter,generic,generic_armv8_a,generic_armv9_a"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexa725,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversen3,neoversev2,grace,neoversev3,neoversev3ae,demeter,generic,generic_armv8_a,generic_armv9_a"
(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aar

[gcc r15-2642] AArch64: Update Generic Armv9-a cost model to release costs

2024-08-01 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:3b0bac451110bf1591ce9085b66857448d099a8c

commit r15-2642-g3b0bac451110bf1591ce9085b66857448d099a8c
Author: Tamar Christina 
Date:   Thu Aug 1 16:54:31 2024 +0100

AArch64: Update Generic Armv9-a cost model to release costs

this updates the costs for gener-armv9-a based on the updated costs for
Neoverse V2 and Neoverse N2.

gcc/ChangeLog:

* config/aarch64/tuning_models/generic_armv9_a.h: Update costs.

Diff:
---
 gcc/config/aarch64/tuning_models/generic_armv9_a.h | 50 +++---
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
index 0a08c4b43473..7156dbe5787e 100644
--- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
+++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
@@ -58,7 +58,7 @@ static const advsimd_vec_cost 
generic_armv9_a_advsimd_vector_cost =
   2, /* ld2_st2_permute_cost */
   2, /* ld3_st3_permute_cost  */
   3, /* ld4_st4_permute_cost  */
-  3, /* permute_cost  */
+  2, /* permute_cost  */
   4, /* reduc_i8_cost  */
   4, /* reduc_i16_cost  */
   2, /* reduc_i32_cost  */
@@ -87,28 +87,28 @@ static const sve_vec_cost generic_armv9_a_sve_vector_cost =
   {
 2, /* int_stmt_cost  */
 2, /* fp_stmt_cost  */
-3, /* ld2_st2_permute_cost  */
-4, /* ld3_st3_permute_cost  */
-4, /* ld4_st4_permute_cost  */
-3, /* permute_cost  */
+2, /* ld2_st2_permute_cost  */
+3, /* ld3_st3_permute_cost  */
+3, /* ld4_st4_permute_cost  */
+2, /* permute_cost  */
 /* Theoretically, a reduction involving 15 scalar ADDs could
complete in ~5 cycles and would have a cost of 15.  [SU]ADDV
-   completes in 11 cycles, so give it a cost of 15 + 6.  */
-21, /* reduc_i8_cost  */
-/* Likewise for 7 scalar ADDs (~3 cycles) vs. 9: 7 + 6.  */
-13, /* reduc_i16_cost  */
-/* Likewise for 3 scalar ADDs (~2 cycles) vs. 8: 3 + 6.  */
-9, /* reduc_i32_cost  */
-/* Likewise for 1 scalar ADD (~1 cycles) vs. 2: 1 + 1.  */
-2, /* reduc_i64_cost  */
+   completes in 9 cycles, so give it a cost of 15 + 4.  */
+19, /* reduc_i8_cost  */
+/* Likewise for 7 scalar ADDs (~3 cycles) vs. 8: 7 + 5.  */
+12, /* reduc_i16_cost  */
+/* Likewise for 3 scalar ADDs (~2 cycles) vs. 6: 3 + 4.  */
+7, /* reduc_i32_cost  */
+/* Likewise for 1 scalar ADDs (~1 cycles) vs. 4: 1 + 3.  */
+4, /* reduc_i64_cost  */
 /* Theoretically, a reduction involving 7 scalar FADDs could
-   complete in ~8 cycles and would have a cost of 14.  FADDV
-   completes in 6 cycles, so give it a cost of 14 - 2.  */
-12, /* reduc_f16_cost  */
-/* Likewise for 3 scalar FADDs (~4 cycles) vs. 4: 6 - 0.  */
-6, /* reduc_f32_cost  */
-/* Likewise for 1 scalar FADD (~2 cycles) vs. 2: 2 - 0.  */
-2, /* reduc_f64_cost  */
+   complete in ~8 cycles and would have a cost of  14.  FADDV
+   completes in 8 cycles, so give it a cost of 14 + 0.  */
+14, /* reduc_f16_cost  */
+/* Likewise for 3 scalar FADDs (~4 cycles) vs. 6: 6 + 2.  */
+8, /* reduc_f32_cost  */
+/* Likewise for 1 scalar FADD (~2 cycles) vs. 4: 2 + 2.  */
+4, /* reduc_f64_cost  */
 2, /* store_elt_extra_cost  */
 /* This value is just inherited from the Cortex-A57 table.  */
 8, /* vec_to_scalar_cost  */
@@ -128,7 +128,7 @@ static const sve_vec_cost generic_armv9_a_sve_vector_cost =
   /* A strided Advanced SIMD x64 load would take two parallel FP loads
  (8 cycles) plus an insertion (2 cycles).  Assume a 64-bit SVE gather
  is 1 cycle more.  The Advanced SIMD version is costed as 2 scalar loads
- (cost 8) and a vec_construct (cost 2).  Add a full vector operation
+ (cost 8) and a vec_construct (cost 4).  Add a full vector operation
  (cost 2) to that, to avoid the difference being lost in rounding.
 
  There is no easy comparison between a strided Advanced SIMD x32 load
@@ -166,14 +166,14 @@ static const aarch64_sve_vec_issue_info 
generic_armv9_a_sve_issue_info =
 {
   {
 {
-  3, /* loads_per_cycle  */
+  3, /* loads_stores_per_cycle  */
   2, /* stores_per_cycle  */
   2, /* general_ops_per_cycle  */
   0, /* fp_simd_load_general_ops  */
   1 /* fp_simd_store_general_ops  */
 },
 2, /* ld2_st2_general_ops  */
-3, /* ld3_st3_general_ops  */
+2, /* ld3_st3_general_ops  */
 3 /* ld4_st4_general_ops  */
   },
   2, /* pred_ops_per_cycle  */
@@ -191,7 +191,7 @@ static const aarch64_vec_issue_info 
generic_armv9_a_vec_issue_info =
   &generic_armv9_a_sve_issue_info
 };
 
-/* Neoverse N2 costs for vector insn classes.  */
+/* Generic_armv9_a costs for vector insn classes.  */
 static const struct cpu_vector_cost generic_armv9_a_vector_cost =
 {
   1, /* scalar_int_stmt_cost  */
@@ -228,7 +228,7 @@ static const struct tune_params generic_armv9_a_tunings =
   "32:16", /* loop_ali

[gcc r15-2643] AArch64: Update Neoverse N2 cost model to release costs

2024-08-01 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:f88cb43aed5c7db5676732c755ec4fee960ecbed

commit r15-2643-gf88cb43aed5c7db5676732c755ec4fee960ecbed
Author: Tamar Christina 
Date:   Thu Aug 1 16:54:49 2024 +0100

AArch64: Update Neoverse N2 cost model to release costs

This updates the cost for Neoverse N2 to reflect the updated
Software Optimization Guide.

gcc/ChangeLog:

* config/aarch64/tuning_models/neoversen2.h: Update costs.

Diff:
---
 gcc/config/aarch64/tuning_models/neoversen2.h | 46 +--
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
b/gcc/config/aarch64/tuning_models/neoversen2.h
index be9a48ac3adc..d41e714aa045 100644
--- a/gcc/config/aarch64/tuning_models/neoversen2.h
+++ b/gcc/config/aarch64/tuning_models/neoversen2.h
@@ -57,7 +57,7 @@ static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
   2, /* ld2_st2_permute_cost */
   2, /* ld3_st3_permute_cost  */
   3, /* ld4_st4_permute_cost  */
-  3, /* permute_cost  */
+  2, /* permute_cost  */
   4, /* reduc_i8_cost  */
   4, /* reduc_i16_cost  */
   2, /* reduc_i32_cost  */
@@ -86,27 +86,27 @@ static const sve_vec_cost neoversen2_sve_vector_cost =
   {
 2, /* int_stmt_cost  */
 2, /* fp_stmt_cost  */
-3, /* ld2_st2_permute_cost  */
-4, /* ld3_st3_permute_cost  */
-4, /* ld4_st4_permute_cost  */
-3, /* permute_cost  */
+2, /* ld2_st2_permute_cost  */
+3, /* ld3_st3_permute_cost  */
+3, /* ld4_st4_permute_cost  */
+2, /* permute_cost  */
 /* Theoretically, a reduction involving 15 scalar ADDs could
complete in ~5 cycles and would have a cost of 15.  [SU]ADDV
-   completes in 11 cycles, so give it a cost of 15 + 6.  */
-21, /* reduc_i8_cost  */
-/* Likewise for 7 scalar ADDs (~3 cycles) vs. 9: 7 + 6.  */
-13, /* reduc_i16_cost  */
-/* Likewise for 3 scalar ADDs (~2 cycles) vs. 8: 3 + 6.  */
-9, /* reduc_i32_cost  */
-/* Likewise for 1 scalar ADD (~1 cycles) vs. 2: 1 + 1.  */
-2, /* reduc_i64_cost  */
+   completes in 9 cycles, so give it a cost of 15 + 4.  */
+19, /* reduc_i8_cost  */
+/* Likewise for 7 scalar ADDs (~3 cycles) vs. 8: 7 + 5.  */
+12, /* reduc_i16_cost  */
+/* Likewise for 3 scalar ADDs (~2 cycles) vs. 6: 3 + 4.  */
+7, /* reduc_i32_cost  */
+/* Likewise for 1 scalar ADDs (~1 cycles) vs. 4: 1 + 3.  */
+4, /* reduc_i64_cost  */
 /* Theoretically, a reduction involving 7 scalar FADDs could
-   complete in ~8 cycles and would have a cost of 14.  FADDV
-   completes in 6 cycles, so give it a cost of 14 - 2.  */
+   complete in ~8 cycles and would have a cost of  14.  FADDV
+   completes in 6 cycles, so give it a cost of 14 + -2.  */
 12, /* reduc_f16_cost  */
-/* Likewise for 3 scalar FADDs (~4 cycles) vs. 4: 6 - 0.  */
+/* Likewise for 3 scalar FADDs (~4 cycles) vs. 4: 6 + 0.  */
 6, /* reduc_f32_cost  */
-/* Likewise for 1 scalar FADD (~2 cycles) vs. 2: 2 - 0.  */
+/* Likewise for 1 scalar FADD (~2 cycles) vs. 2: 2 + 0.  */
 2, /* reduc_f64_cost  */
 2, /* store_elt_extra_cost  */
 /* This value is just inherited from the Cortex-A57 table.  */
@@ -127,7 +127,7 @@ static const sve_vec_cost neoversen2_sve_vector_cost =
   /* A strided Advanced SIMD x64 load would take two parallel FP loads
  (8 cycles) plus an insertion (2 cycles).  Assume a 64-bit SVE gather
  is 1 cycle more.  The Advanced SIMD version is costed as 2 scalar loads
- (cost 8) and a vec_construct (cost 2).  Add a full vector operation
+ (cost 8) and a vec_construct (cost 4).  Add a full vector operation
  (cost 2) to that, to avoid the difference being lost in rounding.
 
  There is no easy comparison between a strided Advanced SIMD x32 load
@@ -165,14 +165,14 @@ static const aarch64_sve_vec_issue_info 
neoversen2_sve_issue_info =
 {
   {
 {
-  3, /* loads_per_cycle  */
+  3, /* loads_stores_per_cycle  */
   2, /* stores_per_cycle  */
   2, /* general_ops_per_cycle  */
   0, /* fp_simd_load_general_ops  */
   1 /* fp_simd_store_general_ops  */
 },
 2, /* ld2_st2_general_ops  */
-3, /* ld3_st3_general_ops  */
+2, /* ld3_st3_general_ops  */
 3 /* ld4_st4_general_ops  */
   },
   2, /* pred_ops_per_cycle  */
@@ -190,7 +190,7 @@ static const aarch64_vec_issue_info 
neoversen2_vec_issue_info =
   &neoversen2_sve_issue_info
 };
 
-/* Neoverse N2 costs for vector insn classes.  */
+/* Neoversen2 costs for vector insn classes.  */
 static const struct cpu_vector_cost neoversen2_vector_cost =
 {
   1, /* scalar_int_stmt_cost  */
@@ -220,7 +220,7 @@ static const struct tune_params neoversen2_tunings =
 6, /* load_pred.  */
 1 /* store_pred.  */
   }, /* memmov_cost.  */
-  3, /* issue_rate  */
+  5, /* issue_rate  */
   (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
   "32:16", /* function_align.  *

[gcc r15-2639] AArch64: Add Neoverse V3 core definition and cost model

2024-08-01 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:729000b90300a31ef9ed405635a0be761c5e168b

commit r15-2639-g729000b90300a31ef9ed405635a0be761c5e168b
Author: Tamar Christina 
Date:   Thu Aug 1 16:53:41 2024 +0100

AArch64: Add Neoverse V3 core definition and cost model

This adds a cost model and core definition for Neoverse V3.

It also makes Cortex-X4 use the Neoverse V3 cost model.

gcc/ChangeLog:

* config/aarch64/aarch64-cores.def (cortex-x4): Update.
(neoverse-v3): New.
* config/aarch64/aarch64-tune.md: Regenerate.
* config/aarch64/tuning_models/neoversev3.h: New file.
* config/aarch64/aarch64.cc: Use it.
* doc/invoke.texi: Document it.

Diff:
---
 gcc/config/aarch64/aarch64-cores.def  |   3 +-
 gcc/config/aarch64/aarch64-tune.md|   2 +-
 gcc/config/aarch64/aarch64.cc |   1 +
 gcc/config/aarch64/tuning_models/neoversev3.h | 246 ++
 gcc/doc/invoke.texi   |   1 +
 5 files changed, 251 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 34307fe0c172..96c74657a199 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -188,13 +188,14 @@ AARCH64_CORE("cortex-x2",  cortexx2, cortexa57, V9A,  
(SVE2_BITPERM, MEMTAG, I8M
 
 AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
I8MM, BF16), neoversev2, 0x41, 0xd4e, -1)
 
-AARCH64_CORE("cortex-x4",  cortexx4, cortexa57, V9_2A,  (SVE2_BITPERM, MEMTAG, 
PROFILE), neoversen2, 0x41, 0xd81, -1)
+AARCH64_CORE("cortex-x4",  cortexx4, cortexa57, V9_2A,  (SVE2_BITPERM, MEMTAG, 
PROFILE), neoversev3, 0x41, 0xd81, -1)
 
 AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1)
 AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1)
 
 AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 AARCH64_CORE("grace", grace, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE), neoversev2, 0x41, 0xd4f, -1)
+AARCH64_CORE("neoverse-v3", neoversev3, cortexa57, V9_2A, (SVE2_BITPERM, RNG, 
LS64, MEMTAG, PROFILE), neoversev3, 0x41, 0xd84, -1)
 
 AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index 719fd3dc62a5..0c3339b53e42 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,demeter,generic,generic_armv8_a,generic_armv9_a"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,neoversev3,demeter,generic,generic_armv8_a,generic_armv9_a"
(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9810f2c03900..f29dcf7fe173 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -410,6 +410,7 @@ static const struct

[gcc r15-2644] AArch64: Add Cortex-X925 core definition and cost model

2024-08-01 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:1f53319cae81aea438b6c0ba55f49e5669acf1c8

commit r15-2644-g1f53319cae81aea438b6c0ba55f49e5669acf1c8
Author: Tamar Christina 
Date:   Thu Aug 1 16:55:10 2024 +0100

AArch64: Add Cortex-X925 core definition and cost model

This adds a cost model and core definition for Cortex-X925.

gcc/ChangeLog:

* config/aarch64/aarch64-cores.def (cortex-x925): New.
* config/aarch64/aarch64-tune.md: Regenerate.
* config/aarch64/tuning_models/cortexx925.h: New file.
* config/aarch64/aarch64.cc: Use it.
* doc/invoke.texi: Document it.

Diff:
---
 gcc/config/aarch64/aarch64-cores.def  |   1 +
 gcc/config/aarch64/aarch64-tune.md|   2 +-
 gcc/config/aarch64/aarch64.cc |   1 +
 gcc/config/aarch64/tuning_models/cortexx925.h | 246 ++
 gcc/doc/invoke.texi   |   2 +-
 5 files changed, 250 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 4d6f5a701eee..cc2260036887 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -190,6 +190,7 @@ AARCH64_CORE("cortex-x2",  cortexx2, cortexa57, V9A,  
(SVE2_BITPERM, MEMTAG, I8M
 AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
I8MM, BF16), neoversev2, 0x41, 0xd4e, -1)
 
 AARCH64_CORE("cortex-x4",  cortexx4, cortexa57, V9_2A,  (SVE2_BITPERM, MEMTAG, 
PROFILE), neoversev3, 0x41, 0xd81, -1)
+AARCH64_CORE("cortex-x925", cortexx925, cortexa57, V9_2A,  (SVE2_BITPERM, 
MEMTAG, PROFILE), cortexx925, 0x41, 0xd85, -1)
 
 AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1)
 AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1)
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index d71c631b01c7..4fce0c507f6c 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexa725,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversen3,neoversev2,grace,neoversev3,neoversev3ae,demeter,generic,generic_armv8_a,generic_armv9_a"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexa725,cortexx2,cortexx3,cortexx4,cortexx925,neoversen2,cobalt100,neoversen3,neoversev2,grace,neoversev3,neoversev3ae,demeter,generic,generic_armv8_a,generic_armv9_a"
(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index f1a57159d471..113ebb45cfda 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -392,6 +392,7 @@ static const struct aarch64_flag_desc 
aarch64_tuning_flags[] =
 #include "tuning_models/cortexa57.h"
 #include "tuning_models/cortexa72.h"
 #include "tuning_models/cortexa73.h"
+#include "tuning_models/cortexx925.h"
 #include "tuning_models/exynosm1.h"
 #include "tuning_models/thunderxt88.h"
 #include "tuning_models/thunderx.h"
diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
b/gcc/config/aarch64/tuning_models/cortexx925.h
new file mode 100644
index ..6cae5b7de5ca
--- /dev/null
+++ b/gcc/config/aarch64/tuning_models/cortexx925.h
@@ -0,0 +1,246 @@

[gcc r15-2768] AArch64: take gather/scatter decode overhead into account

2024-08-06 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:a50916a6c0a6c73c1537d033509d4f7034341f75

commit r15-2768-ga50916a6c0a6c73c1537d033509d4f7034341f75
Author: Tamar Christina 
Date:   Tue Aug 6 22:41:10 2024 +0100

AArch64: take gather/scatter decode overhead into account

Gather and scatters are not usually beneficial when the loop count is small.
This is because there's not only a cost to their execution within the loop 
but
there is also some cost to enter loops with them.

As such this patch models this overhead.  For generic tuning we however 
still
prefer gathers/scatters when the loop costs work out.

gcc/ChangeLog:

* config/aarch64/aarch64-protos.h (struct sve_vec_cost): Add
gather_load_x32_init_cost and gather_load_x64_init_cost.
* config/aarch64/aarch64.cc (aarch64_vector_costs): Add
m_sve_gather_scatter_init_cost.
(aarch64_vector_costs::add_stmt_cost): Use them.
(aarch64_vector_costs::finish_cost): Likewise.
* config/aarch64/tuning_models/a64fx.h: Update.
* config/aarch64/tuning_models/cortexx925.h: Update.
* config/aarch64/tuning_models/generic.h: Update.
* config/aarch64/tuning_models/generic_armv8_a.h: Update.
* config/aarch64/tuning_models/generic_armv9_a.h: Update.
* config/aarch64/tuning_models/neoverse512tvb.h: Update.
* config/aarch64/tuning_models/neoversen2.h: Update.
* config/aarch64/tuning_models/neoversen3.h: Update.
* config/aarch64/tuning_models/neoversev1.h: Update.
* config/aarch64/tuning_models/neoversev2.h: Update.
* config/aarch64/tuning_models/neoversev3.h: Update.
* config/aarch64/tuning_models/neoversev3ae.h: Update.

Diff:
---
 gcc/config/aarch64/aarch64-protos.h| 10 +
 gcc/config/aarch64/aarch64.cc  | 26 ++
 gcc/config/aarch64/tuning_models/a64fx.h   |  2 ++
 gcc/config/aarch64/tuning_models/cortexx925.h  |  2 ++
 gcc/config/aarch64/tuning_models/generic.h |  2 ++
 gcc/config/aarch64/tuning_models/generic_armv8_a.h |  2 ++
 gcc/config/aarch64/tuning_models/generic_armv9_a.h |  2 ++
 gcc/config/aarch64/tuning_models/neoverse512tvb.h  |  2 ++
 gcc/config/aarch64/tuning_models/neoversen2.h  |  2 ++
 gcc/config/aarch64/tuning_models/neoversen3.h  |  2 ++
 gcc/config/aarch64/tuning_models/neoversev1.h  |  2 ++
 gcc/config/aarch64/tuning_models/neoversev2.h  |  2 ++
 gcc/config/aarch64/tuning_models/neoversev3.h  |  2 ++
 gcc/config/aarch64/tuning_models/neoversev3ae.h|  2 ++
 14 files changed, 60 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index f64afe288901..44b881b5c57a 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -262,6 +262,8 @@ struct sve_vec_cost : simd_vec_cost
  unsigned int fadda_f64_cost,
  unsigned int gather_load_x32_cost,
  unsigned int gather_load_x64_cost,
+ unsigned int gather_load_x32_init_cost,
+ unsigned int gather_load_x64_init_cost,
  unsigned int scatter_store_elt_cost)
 : simd_vec_cost (base),
   clast_cost (clast_cost),
@@ -270,6 +272,8 @@ struct sve_vec_cost : simd_vec_cost
   fadda_f64_cost (fadda_f64_cost),
   gather_load_x32_cost (gather_load_x32_cost),
   gather_load_x64_cost (gather_load_x64_cost),
+  gather_load_x32_init_cost (gather_load_x32_init_cost),
+  gather_load_x64_init_cost (gather_load_x64_init_cost),
   scatter_store_elt_cost (scatter_store_elt_cost)
   {}
 
@@ -289,6 +293,12 @@ struct sve_vec_cost : simd_vec_cost
   const int gather_load_x32_cost;
   const int gather_load_x64_cost;
 
+  /* Additional loop initialization cost of using a gather load instruction.  
The x32
+ value is for loads of 32-bit elements and the x64 value is for loads of
+ 64-bit elements.  */
+  const int gather_load_x32_init_cost;
+  const int gather_load_x64_init_cost;
+
   /* The per-element cost of a scatter store.  */
   const int scatter_store_elt_cost;
 };
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9e12bd9711cd..2ac5a22c848e 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16231,6 +16231,10 @@ private:
  supported by Advanced SIMD and SVE2.  */
   bool m_has_avg = false;
 
+  /* Additional initialization costs for using gather or scatter operation in
+ the current loop.  */
+  unsigned int m_sve_gather_scatter_init_cost = 0;
+
   /* True if the vector body contains a store to a decl and if the
  function is known to have a vld1 from the same decl.
 
@@ -17295,6 +17299,23 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,

[gcc r15-2839] AArch64: Fix signbit mask creation after late combine [PR116229]

2024-08-08 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:2c24e0568392e51a77ebdaab629d631969ce8966

commit r15-2839-g2c24e0568392e51a77ebdaab629d631969ce8966
Author: Tamar Christina 
Date:   Thu Aug 8 18:51:30 2024 +0100

AArch64: Fix signbit mask creation after late combine [PR116229]

The optimization to generate a Di signbit constant by using fneg was relying
on nothing being able to push the constant into the negate.  It's run quite
late for this reason.

However late combine now runs after it and triggers RTL simplification 
based on
the neg.  When -fno-signed-zeros this ends up dropping the - from the -0.0 
and
thus producing incorrect code.

This change adds a new unspec FNEG on DI mode which prevents this 
simplication.

gcc/ChangeLog:

PR target/116229
* config/aarch64/aarch64-simd.md (aarch64_fnegv2di2): 
New.
* config/aarch64/aarch64.cc (aarch64_maybe_generate_simd_constant):
Update call to gen_aarch64_fnegv2di2.
* config/aarch64/iterators.md: New UNSPEC_FNEG.

gcc/testsuite/ChangeLog:

PR target/116229
* gcc.target/aarch64/pr116229.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-simd.md  |  9 +
 gcc/config/aarch64/aarch64.cc   |  4 ++--
 gcc/config/aarch64/iterators.md |  1 +
 gcc/testsuite/gcc.target/aarch64/pr116229.c | 20 
 4 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 816f499e9634..cc612ec2ca0e 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -2629,6 +2629,15 @@
   [(set_attr "type" "neon_fp_neg_")]
 )
 
+(define_insn "aarch64_fnegv2di2"
+ [(set (match_operand:V2DI 0 "register_operand" "=w")
+   (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "w")]
+ UNSPEC_FNEG))]
+ "TARGET_SIMD"
+ "fneg\\t%0.2d, %1.2d"
+  [(set_attr "type" "neon_fp_neg_d")]
+)
+
 (define_insn "abs2"
  [(set (match_operand:VHSDF 0 "register_operand" "=w")
(abs:VHSDF (match_operand:VHSDF 1 "register_operand" "w")))]
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 2ac5a22c848e..bfd7bcdef7cb 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -11808,8 +11808,8 @@ aarch64_maybe_generate_simd_constant (rtx target, rtx 
val, machine_mode mode)
   /* Use the same base type as aarch64_gen_shareable_zero.  */
   rtx zero = CONST0_RTX (V4SImode);
   emit_move_insn (lowpart_subreg (V4SImode, target, mode), zero);
-  rtx neg = lowpart_subreg (V2DFmode, target, mode);
-  emit_insn (gen_negv2df2 (neg, copy_rtx (neg)));
+  rtx neg = lowpart_subreg (V2DImode, target, mode);
+  emit_insn (gen_aarch64_fnegv2di2 (neg, copy_rtx (neg)));
   return true;
 }
 
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index aaa4afefe2ce..20a318e023b6 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -689,6 +689,7 @@
 UNSPEC_FMINNMV ; Used in aarch64-simd.md.
 UNSPEC_FMINV   ; Used in aarch64-simd.md.
 UNSPEC_FADDV   ; Used in aarch64-simd.md.
+UNSPEC_FNEG; Used in aarch64-simd.md.
 UNSPEC_ADDV; Used in aarch64-simd.md.
 UNSPEC_SMAXV   ; Used in aarch64-simd.md.
 UNSPEC_SMINV   ; Used in aarch64-simd.md.
diff --git a/gcc/testsuite/gcc.target/aarch64/pr116229.c 
b/gcc/testsuite/gcc.target/aarch64/pr116229.c
new file mode 100644
index ..cc42078478f7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr116229.c
@@ -0,0 +1,20 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fno-signed-zeros" } */
+
+typedef __attribute__((__vector_size__ (8))) unsigned long V;
+
+V __attribute__((__noipa__))
+foo (void)
+{
+  return (V){ 0x8000 };
+}
+
+V ref = (V){ 0x8000 };
+
+int
+main ()
+{
+  V v = foo ();
+  if (v[0] != ref[0])
+__builtin_abort();
+}

[gcc r15-1038] AArch64: convert several predicate patterns to new compact syntax

2024-06-05 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:fd4898891ae0c73d6b7aa433cd1ef4539aaa2457

commit r15-1038-gfd4898891ae0c73d6b7aa433cd1ef4539aaa2457
Author: Tamar Christina 
Date:   Wed Jun 5 19:30:39 2024 +0100

AArch64: convert several predicate patterns to new compact syntax

This converts the single alternative patterns to the new compact syntax such
that when I add the new alternatives it's clearer what's being changed.

Note that this will spew out a bunch of warnings from geninsn as it'll warn 
that
@ is useless for a single alternative pattern.  These are not fatal so won't
break the build and are only temporary.

No change in functionality is expected with this patch.

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md (and3,
@aarch64_pred__z, *3_cc,
*3_ptest, aarch64_pred__z,
*3_cc, *3_ptest,
aarch64_pred__z, *3_cc,
*3_ptest, *cmp_ptest,
@aarch64_pred_cmp_wide,
*aarch64_pred_cmp_wide_cc,
*aarch64_pred_cmp_wide_ptest, *aarch64_brk_cc,
*aarch64_brk_ptest, @aarch64_brk,
*aarch64_brk_cc, *aarch64_brk_ptest, 
aarch64_rdffr_z,
*aarch64_rdffr_z_ptest, *aarch64_rdffr_ptest, *aarch64_rdffr_z_cc,
*aarch64_rdffr_cc): Convert to compact syntax.
* config/aarch64/aarch64-sve2.md
(@aarch64_pred_): Likewise.

Diff:
---
 gcc/config/aarch64/aarch64-sve.md  | 262 ++---
 gcc/config/aarch64/aarch64-sve2.md |  12 +-
 2 files changed, 161 insertions(+), 113 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 0434358122d..ca4d435e705 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -1156,76 +1156,86 @@
 
 ;; Likewise with zero predication.
 (define_insn "aarch64_rdffr_z"
-  [(set (match_operand:VNx16BI 0 "register_operand" "=Upa")
+  [(set (match_operand:VNx16BI 0 "register_operand")
(and:VNx16BI
  (reg:VNx16BI FFRT_REGNUM)
- (match_operand:VNx16BI 1 "register_operand" "Upa")))]
+ (match_operand:VNx16BI 1 "register_operand")))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  "rdffr\t%0.b, %1/z"
+  {@ [ cons: =0, 1   ]
+ [ Upa , Upa ] rdffr\t%0.b, %1/z
+  }
 )
 
 ;; Read the FFR to test for a fault, without using the predicate result.
 (define_insn "*aarch64_rdffr_z_ptest"
   [(set (reg:CC_NZC CC_REGNUM)
(unspec:CC_NZC
- [(match_operand:VNx16BI 1 "register_operand" "Upa")
+ [(match_operand:VNx16BI 1 "register_operand")
   (match_dup 1)
   (match_operand:SI 2 "aarch64_sve_ptrue_flag")
   (and:VNx16BI
 (reg:VNx16BI FFRT_REGNUM)
 (match_dup 1))]
  UNSPEC_PTEST))
-   (clobber (match_scratch:VNx16BI 0 "=Upa"))]
+   (clobber (match_scratch:VNx16BI 0))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  "rdffrs\t%0.b, %1/z"
+  {@ [ cons: =0, 1   ]
+ [ Upa , Upa ] rdffrs\t%0.b, %1/z
+  }
 )
 
 ;; Same for unpredicated RDFFR when tested with a known PTRUE.
 (define_insn "*aarch64_rdffr_ptest"
   [(set (reg:CC_NZC CC_REGNUM)
(unspec:CC_NZC
- [(match_operand:VNx16BI 1 "register_operand" "Upa")
+ [(match_operand:VNx16BI 1 "register_operand")
   (match_dup 1)
   (const_int SVE_KNOWN_PTRUE)
   (reg:VNx16BI FFRT_REGNUM)]
  UNSPEC_PTEST))
-   (clobber (match_scratch:VNx16BI 0 "=Upa"))]
+   (clobber (match_scratch:VNx16BI 0))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  "rdffrs\t%0.b, %1/z"
+  {@ [ cons: =0, 1   ]
+ [ Upa , Upa ] rdffrs\t%0.b, %1/z
+  }
 )
 
 ;; Read the FFR with zero predication and test the result.
 (define_insn "*aarch64_rdffr_z_cc"
   [(set (reg:CC_NZC CC_REGNUM)
(unspec:CC_NZC
- [(match_operand:VNx16BI 1 "register_operand" "Upa")
+ [(match_operand:VNx16BI 1 "register_operand")
   (match_dup 1)
   (match_operand:SI 2 "aarch64_sve_ptrue_flag")
   (and:VNx16BI
 (reg:VNx16BI FFRT_REGNUM)
 (match_dup 1))]
  UNSPEC_PTEST))
-   (set (match_operand:VNx16BI 0 "register_operand" "=Upa")
+   (set (match_operand:VNx16BI 0 "register_operand")
(and:VNx16BI
  (reg:VNx16BI FFRT_REGNUM)
  (match_dup 1)))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  "rdffrs\t%0.b, %1/z"
+  {@ [ cons: =0, 1   ]
+ [ Upa , Upa ] rdffrs\t%0.b, %1/z
+  }
 )
 
 ;; Same for unpredicated RDFFR when tested with a known PTRUE.
 (define_insn "*aarch64_rdffr_cc"
   [(set (reg:CC_NZC CC_REGNUM)
(unspec:CC_NZC
- [(match_operand:VNx16BI 1 "register_operand" "Upa")
+ [(match_operand:VNx16BI 1 "register_operand")
   (match_dup 1)
   (const_int SVE_KNOWN_PTRUE)
   (reg:VNx16BI FFRT_REGNUM)]
  UNSPEC_PTEST))
-   (set (match_operand:VNx16BI 0 "register_operand" "=Upa")
+   (set (match_operand:VNx16BI 0 "registe

[gcc r15-1039] AArch64: add new tuning param and attribute for enabling conditional early clobber

2024-06-05 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:35f17c680ca650f8658994f857358e5a529c0b93

commit r15-1039-g35f17c680ca650f8658994f857358e5a529c0b93
Author: Tamar Christina 
Date:   Wed Jun 5 19:31:11 2024 +0100

AArch64: add new tuning param and attribute for enabling conditional early 
clobber

This adds a new tuning parameter AARCH64_EXTRA_TUNE_AVOID_PRED_RMW for 
AArch64 to
allow us to conditionally enable the early clobber alternatives based on the
tuning models.

gcc/ChangeLog:

* config/aarch64/aarch64-tuning-flags.def
(AVOID_PRED_RMW): New.
* config/aarch64/aarch64.h (TARGET_SVE_PRED_CLOBBER): New.
* config/aarch64/aarch64.md (pred_clobber): New.
(arch_enabled): Use it.

Diff:
---
 gcc/config/aarch64/aarch64-tuning-flags.def |  4 
 gcc/config/aarch64/aarch64.h|  5 +
 gcc/config/aarch64/aarch64.md   | 18 --
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def
index d5bcaebce77..a9f48f5d3d4 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -48,4 +48,8 @@ AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
AVOID_CROSS_LOOP_FMA)
 
 AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma", FULLY_PIPELINED_FMA)
 
+/* Enable is the target prefers to use a fresh register for predicate outputs
+   rather than re-use an input predicate register.  */
+AARCH64_EXTRA_TUNING_OPTION ("avoid_pred_rmw", AVOID_PRED_RMW)
+
 #undef AARCH64_EXTRA_TUNING_OPTION
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index bbf11faaf4b..0997b82dbc0 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -495,6 +495,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = 
AARCH64_FL_SM_OFF;
 enabled through +gcs.  */
 #define TARGET_GCS (AARCH64_ISA_GCS)
 
+/* Prefer different predicate registers for the output of a predicated
+   operation over re-using an existing input predicate.  */
+#define TARGET_SVE_PRED_CLOBBER (TARGET_SVE \
+&& (aarch64_tune_params.extra_tuning_flags \
+& AARCH64_EXTRA_TUNE_AVOID_PRED_RMW))
 
 /* Standard register usage.  */
 
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 9dff2d7a2b0..389a1906e23 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -445,6 +445,10 @@
 ;; target-independent code.
 (define_attr "is_call" "no,yes" (const_string "no"))
 
+;; Indicates whether we want to enable the pattern with an optional early
+;; clobber for SVE predicates.
+(define_attr "pred_clobber" "any,no,yes" (const_string "any"))
+
 ;; [For compatibility with Arm in pipeline models]
 ;; Attribute that specifies whether or not the instruction touches fp
 ;; registers.
@@ -460,7 +464,17 @@
 
 (define_attr "arch_enabled" "no,yes"
   (if_then_else
-(ior
+(and
+  (ior
+   (and
+ (eq_attr "pred_clobber" "no")
+ (match_test "!TARGET_SVE_PRED_CLOBBER"))
+   (and
+ (eq_attr "pred_clobber" "yes")
+ (match_test "TARGET_SVE_PRED_CLOBBER"))
+   (eq_attr "pred_clobber" "any"))
+
+  (ior
(eq_attr "arch" "any")
 
(and (eq_attr "arch" "rcpc8_4")
@@ -488,7 +502,7 @@
 (match_test "TARGET_SVE"))
 
(and (eq_attr "arch" "sme")
-(match_test "TARGET_SME")))
+(match_test "TARGET_SME"
 (const_string "yes")
 (const_string "no")))

[gcc r15-1040] AArch64: add new alternative with early clobber to patterns

2024-06-05 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:2de3bbde1ebea8689f3596967769f66bf903458e

commit r15-1040-g2de3bbde1ebea8689f3596967769f66bf903458e
Author: Tamar Christina 
Date:   Wed Jun 5 19:31:39 2024 +0100

AArch64: add new alternative with early clobber to patterns

This patch adds new alternatives to the patterns which are affected.  The 
new
alternatives with the conditional early clobbers are added before the normal
ones in order for LRA to prefer them in the event that we have enough free
registers to accommodate them.

In case register pressure is too high the normal alternatives will be 
preferred
before a reload is considered as we rather have the tie than a spill.

Tests are in the next patch.

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md (and3,
@aarch64_pred__z, *3_cc,
*3_ptest, aarch64_pred__z,
*3_cc, *3_ptest,
aarch64_pred__z, *3_cc,
*3_ptest, @aarch64_pred_cmp,
*cmp_cc, *cmp_ptest,
@aarch64_pred_cmp_wide,
*aarch64_pred_cmp_wide_cc,
*aarch64_pred_cmp_wide_ptest, @aarch64_brk,
*aarch64_brk_cc, *aarch64_brk_ptest,
@aarch64_brk, *aarch64_brk_cc,
*aarch64_brk_ptest, aarch64_rdffr_z, *aarch64_rdffr_z_ptest,
*aarch64_rdffr_ptest, *aarch64_rdffr_z_cc, *aarch64_rdffr_cc): Add
new early clobber
alternative.
* config/aarch64/aarch64-sve2.md
(@aarch64_pred_): Likewise.

Diff:
---
 gcc/config/aarch64/aarch64-sve.md  | 178 +
 gcc/config/aarch64/aarch64-sve2.md |   6 +-
 2 files changed, 124 insertions(+), 60 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index ca4d435e705..d902bce62fd 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -1161,8 +1161,10 @@
  (reg:VNx16BI FFRT_REGNUM)
  (match_operand:VNx16BI 1 "register_operand")))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  {@ [ cons: =0, 1   ]
- [ Upa , Upa ] rdffr\t%0.b, %1/z
+  {@ [ cons: =0, 1   ; attrs: pred_clobber ]
+ [ &Upa, Upa ; yes ] rdffr\t%0.b, %1/z
+ [ ?Upa, 0Upa; yes ] ^
+ [ Upa , Upa ; no  ] ^
   }
 )
 
@@ -1179,8 +1181,10 @@
  UNSPEC_PTEST))
(clobber (match_scratch:VNx16BI 0))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  {@ [ cons: =0, 1   ]
- [ Upa , Upa ] rdffrs\t%0.b, %1/z
+  {@ [ cons: =0, 1   ; attrs: pred_clobber ]
+ [ &Upa, Upa ; yes ] rdffrs\t%0.b, %1/z
+ [ ?Upa, 0Upa; yes ] ^
+ [ Upa , Upa ; no  ] ^
   }
 )
 
@@ -1195,8 +1199,10 @@
  UNSPEC_PTEST))
(clobber (match_scratch:VNx16BI 0))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  {@ [ cons: =0, 1   ]
- [ Upa , Upa ] rdffrs\t%0.b, %1/z
+  {@ [ cons: =0, 1   ; attrs: pred_clobber ]
+ [ &Upa, Upa ; yes ] rdffrs\t%0.b, %1/z
+ [ ?Upa, 0Upa; yes ] ^
+ [ Upa , Upa ; no  ] ^
   }
 )
 
@@ -1216,8 +1222,10 @@
  (reg:VNx16BI FFRT_REGNUM)
  (match_dup 1)))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  {@ [ cons: =0, 1   ]
- [ Upa , Upa ] rdffrs\t%0.b, %1/z
+  {@ [ cons: =0, 1   ; attrs: pred_clobber ]
+ [ &Upa, Upa ; yes ] rdffrs\t%0.b, %1/z
+ [ ?Upa, 0Upa; yes ] ^
+ [ Upa , Upa ; no  ] ^
   }
 )
 
@@ -1233,8 +1241,10 @@
(set (match_operand:VNx16BI 0 "register_operand")
(reg:VNx16BI FFRT_REGNUM))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  {@ [ cons: =0, 1   ]
- [ Upa , Upa ] rdffrs\t%0.b, %1/z
+  {@ [ cons: =0, 1   ; attrs: pred_clobber ]
+ [ &Upa, Upa ; yes ] rdffrs\t%0.b, %1/z
+ [ ?Upa, 0Upa; yes ] ^
+ [ Upa , Upa ; no  ] ^
   }
 )
 
@@ -6651,8 +6661,10 @@
(and:PRED_ALL (match_operand:PRED_ALL 1 "register_operand")
  (match_operand:PRED_ALL 2 "register_operand")))]
   "TARGET_SVE"
-  {@ [ cons: =0, 1  , 2   ]
- [ Upa , Upa, Upa ] and\t%0.b, %1/z, %2.b, %2.b
+  {@ [ cons: =0, 1   , 2   ; attrs: pred_clobber ]
+ [ &Upa, Upa , Upa ; yes ] and\t%0.b, %1/z, %2.b, %2.b
+ [ ?Upa, 0Upa, 0Upa; yes ] ^
+ [ Upa , Upa , Upa ; no  ] ^
   }
 )
 
@@ -6679,8 +6691,10 @@
(match_operand:PRED_ALL 3 "register_operand"))
  (match_operand:PRED_ALL 1 "register_operand")))]
   "TARGET_SVE"
-  {@ [ cons: =0, 1  , 2  , 3   ]
- [ Upa , Upa, Upa, Upa ] \t%0.b, %1/z, %2.b, %3.b
+  {@ [ cons: =0, 1   , 2   , 3   ; attrs: pred_clobber ]
+ [ &Upa, Upa , Upa , Upa ; yes ] \t%0.b, 
%1/z, %2.b, %3.b
+ [ ?Upa, 0Upa, 0Upa, 0Upa; yes

[gcc r15-1041] AArch64: enable new predicate tuning for Neoverse cores.

2024-06-05 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:3eb9f6eab9802d5ae65ead6b1f2ae6fe0833e06e

commit r15-1041-g3eb9f6eab9802d5ae65ead6b1f2ae6fe0833e06e
Author: Tamar Christina 
Date:   Wed Jun 5 19:32:16 2024 +0100

AArch64: enable new predicate tuning for Neoverse cores.

This enables the new tuning flag for Neoverse V1, Neoverse V2 and Neoverse 
N2.
It is kept off for generic codegen.

Note the reason for the +sve even though they are in aarch64-sve.exp is if 
the
testsuite is ran with a forced SVE off option, e.g. -march=armv8-a+nosve 
then
the intrinsics end up being disabled because the -march is preferred over 
the
-mcpu even though the -mcpu comes later.

This prevents the tests from failing in such runs.

gcc/ChangeLog:

* config/aarch64/tuning_models/neoversen2.h (neoversen2_tunings): 
Add
AARCH64_EXTRA_TUNE_AVOID_PRED_RMW.
* config/aarch64/tuning_models/neoversev1.h (neoversev1_tunings): 
Add
AARCH64_EXTRA_TUNE_AVOID_PRED_RMW.
* config/aarch64/tuning_models/neoversev2.h (neoversev2_tunings): 
Add
AARCH64_EXTRA_TUNE_AVOID_PRED_RMW.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/pred_clobber_1.c: New test.
* gcc.target/aarch64/sve/pred_clobber_2.c: New test.
* gcc.target/aarch64/sve/pred_clobber_3.c: New test.
* gcc.target/aarch64/sve/pred_clobber_4.c: New test.

Diff:
---
 gcc/config/aarch64/tuning_models/neoversen2.h  |  3 ++-
 gcc/config/aarch64/tuning_models/neoversev1.h  |  3 ++-
 gcc/config/aarch64/tuning_models/neoversev2.h  |  3 ++-
 .../gcc.target/aarch64/sve/pred_clobber_1.c| 22 +
 .../gcc.target/aarch64/sve/pred_clobber_2.c| 22 +
 .../gcc.target/aarch64/sve/pred_clobber_3.c| 23 ++
 .../gcc.target/aarch64/sve/pred_clobber_4.c| 22 +
 7 files changed, 95 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
b/gcc/config/aarch64/tuning_models/neoversen2.h
index 7e799bbe762..be9a48ac3ad 100644
--- a/gcc/config/aarch64/tuning_models/neoversen2.h
+++ b/gcc/config/aarch64/tuning_models/neoversen2.h
@@ -236,7 +236,8 @@ static const struct tune_params neoversen2_tunings =
   (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
| AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
-   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),/* tune_flags.  */
+   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
+   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),   /* tune_flags.  */
   &generic_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS   /* stp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
b/gcc/config/aarch64/tuning_models/neoversev1.h
index 9363f2ad98a..0fc41ce6a41 100644
--- a/gcc/config/aarch64/tuning_models/neoversev1.h
+++ b/gcc/config/aarch64/tuning_models/neoversev1.h
@@ -227,7 +227,8 @@ static const struct tune_params neoversev1_tunings =
   (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
| AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
-   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
+   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
+   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),   /* tune_flags.  */
   &generic_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS/* stp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
b/gcc/config/aarch64/tuning_models/neoversev2.h
index bc01ed767c9..f76e4ef358f 100644
--- a/gcc/config/aarch64/tuning_models/neoversev2.h
+++ b/gcc/config/aarch64/tuning_models/neoversev2.h
@@ -236,7 +236,8 @@ static const struct tune_params neoversev2_tunings =
   (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
| AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
-   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),/* tune_flags.  */
+   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
+   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),   /* tune_flags.  */
   &generic_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS   /* stp_policy_model.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c
new file mode 100644
index 000..25129e8d6f2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=neoverse-n2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#pragma GCC target "+sve"
+
+#include 
+
+extern void use(svbool_t);
+
+/*
+** foo:
+** ...
+** ptrue   p([1-3]).b, all
+** cmplo   p0.h, p\1/z, z0.h, z[0-9]+.h
+** ...

[gcc r15-1071] AArch64: correct constraint on Upl early clobber alternatives

2024-06-06 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:afe85f8e22a703280b17c701f3490d89337f674a

commit r15-1071-gafe85f8e22a703280b17c701f3490d89337f674a
Author: Tamar Christina 
Date:   Thu Jun 6 14:35:48 2024 +0100

AArch64: correct constraint on Upl early clobber alternatives

I made an oversight in the previous patch, where I added a ?Upa
alternative to the Upl cases.  This causes it to create the tie
between the larger register file rather than the constrained one.

This fixes the affected patterns.

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md (@aarch64_pred_cmp,
*cmp_cc, *cmp_ptest,
@aarch64_pred_cmp_wide,
*aarch64_pred_cmp_wide_cc,
*aarch64_pred_cmp_wide_ptest): Fix Upl tie 
alternative.
* config/aarch64/aarch64-sve2.md 
(@aarch64_pred_): Fix
Upl tie alternative.

Diff:
---
 gcc/config/aarch64/aarch64-sve.md  | 64 +++---
 gcc/config/aarch64/aarch64-sve2.md |  2 +-
 2 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index d902bce62fd..d69db34016a 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -8134,13 +8134,13 @@
  UNSPEC_PRED_Z))
(clobber (reg:CC_NZC CC_REGNUM))]
   "TARGET_SVE"
-  {@ [ cons: =0 , 1   , 3 , 4; attrs: pred_clobber ]
- [ &Upa , Upl , w , ; yes ] 
cmp\t%0., %1/z, %3., #%4
- [ ?Upa , 0Upl, w , ; yes ] ^
- [ Upa  , Upl , w , ; no  ] ^
- [ &Upa , Upl , w , w; yes ] 
cmp\t%0., %1/z, %3., %4.
- [ ?Upa , 0Upl, w , w; yes ] ^
- [ Upa  , Upl , w , w; no  ] ^
+  {@ [ cons: =0 , 1  , 3 , 4; attrs: pred_clobber ]
+ [ &Upa , Upl, w , ; yes ] 
cmp\t%0., %1/z, %3., #%4
+ [ ?Upl , 0  , w , ; yes ] ^
+ [ Upa  , Upl, w , ; no  ] ^
+ [ &Upa , Upl, w , w; yes ] 
cmp\t%0., %1/z, %3., %4.
+ [ ?Upl , 0  , w , w; yes ] ^
+ [ Upa  , Upl, w , w; no  ] ^
   }
 )
 
@@ -8170,13 +8170,13 @@
  UNSPEC_PRED_Z))]
   "TARGET_SVE
&& aarch64_sve_same_pred_for_ptest_p (&operands[4], &operands[6])"
-  {@ [ cons: =0 , 1, 2 , 3; attrs: pred_clobber ]
- [ &Upa ,  Upl , w , ; yes ] 
cmp\t%0., %1/z, %2., #%3
- [ ?Upa ,  0Upl, w , ; yes ] ^
- [ Upa  ,  Upl , w , ; no  ] ^
- [ &Upa ,  Upl , w , w; yes ] 
cmp\t%0., %1/z, %2., %3.
- [ ?Upa ,  0Upl, w , w; yes ] ^
- [ Upa  ,  Upl , w , w; no  ] ^
+  {@ [ cons: =0 , 1   , 2 , 3; attrs: pred_clobber ]
+ [ &Upa ,  Upl, w , ; yes ] 
cmp\t%0., %1/z, %2., #%3
+ [ ?Upl ,  0  , w , ; yes ] ^
+ [ Upa  ,  Upl, w , ; no  ] ^
+ [ &Upa ,  Upl, w , w; yes ] 
cmp\t%0., %1/z, %2., %3.
+ [ ?Upl ,  0  , w , w; yes ] ^
+ [ Upa  ,  Upl, w , w; no  ] ^
   }
   "&& !rtx_equal_p (operands[4], operands[6])"
   {
@@ -8205,12 +8205,12 @@
   "TARGET_SVE
&& aarch64_sve_same_pred_for_ptest_p (&operands[4], &operands[6])"
   {@ [ cons: =0, 1, 2 , 3; attrs: pred_clobber ]
- [ &Upa,  Upl , w , ; yes ] 
cmp\t%0., %1/z, %2., #%3
- [ ?Upa,  0Upl, w , ; yes ] ^
- [ Upa ,  Upl , w , ; no  ] ^
- [ &Upa,  Upl , w , w; yes ] 
cmp\t%0., %1/z, %2., %3.
- [ ?Upa,  0Upl, w , w; yes ] ^
- [ Upa ,  Upl , w , w; no  ] ^
+ [ &Upa,  Upl, w , ; yes ] 
cmp\t%0., %1/z, %2., #%3
+ [ ?Upl,  0  , w , ; yes ] ^
+ [ Upa ,  Upl, w , ; no  ] ^
+ [ &Upa,  Upl, w , w; yes ] 
cmp\t%0., %1/z, %2., %3.
+ [ ?Upl,  0  , w , w; yes ] ^
+ [ Upa ,  Upl, w , w; no  ] ^
   }
   "&& !rtx_equal_p (operands[4], operands[6])"
   {
@@ -8263,10 +8263,10 @@
  UNSPEC_PRED_Z))
(clobber (reg:CC_NZC CC_REGNUM))]
   "TARGET_SVE"
-  {@ [ cons: =0, 1, 2, 3, 4; attrs: pred_clobber ]
- [ &Upa,  Upl ,  , w, w; yes ] 
cmp\t%0., %1/z, %3., %4.d
- [ ?Upa,  0Upl,  , w, w; yes ] ^
- [ Upa ,  Upl ,  , w, w; no  ] ^
+  {@ [ cons: =0, 1   , 2, 3, 4; attrs: pred_clobber ]
+ [ &Upa

[gcc r15-4324] middle-end: support SLP early break

2024-10-14 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:accb85345edb91368221fd07b74e74df427b7de0

commit r15-4324-gaccb85345edb91368221fd07b74e74df427b7de0
Author: Tamar Christina 
Date:   Mon Oct 14 11:58:59 2024 +0100

middle-end: support SLP early break

This patch introduces feature parity for early break int the SLP only
vectorizer.

The approach taken here is to treat the early exits as root statements for 
an
SLP tree.  This means that we don't need any changes to build_slp to support
gconds.

Codegen for the gcond itself now has to be done out of line but the body of 
the
SLP blocks itself is simply driven by SLP scheduling.  There is a slight
awkwardness in having re-used vectorizable_early_exit for both SLP and 
non-SLP
but I've documented the differences and when I did try to refactor it it 
wasn't
really worth it given that this is a temporary state anyway.

This version is restricted to lane = 1, as such we can re-use the existing
move_early_break function instead of having to do safety update through
scheduling.  I have a branch where I'm working on that but lane > 1 is out 
of
scope for GCC 15 anyway.   The only reason I will try to get moving through
scheduling done as a stretch goal is so we get epilogue vectorization back 
for
early break.

The example:

unsigned test4(unsigned x)
{
 unsigned ret = 0;
 for (int i = 0; i < N; i++)
 {
   vect_b[i] = x + i;
   if (vect_a[i]*2 != x)
 break;
   vect_a[i] = x;

 }
 return ret;
}

builds the following SLP instance for early break:

note:   Analyzing vectorizable control flow: if (patt_6 != 0)
note:   Starting SLP discovery for
note: patt_6 = _4 != x_9(D);
note:   starting SLP discovery for node 0x63abc80
note:   Build SLP for patt_6 = _4 != x_9(D);
note:   precomputed vectype: vector(4) 
note:   nunits = 4
note:   vect_is_simple_use: operand x_9(D), type of def: external
note:   vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, 
+INF] MASK 0x
_3 * 2, type of def: internal
note:   starting SLP discovery for node 0x63abdc0
note:   Build SLP for _4 = _3 * 2;
note:   precomputed vectype: vector(4) unsigned int
note:   nunits = 4
note:   vect_is_simple_use: operand #
vect_aD.4416[i_15], type of def: internal
note:   vect_is_simple_use: operand 2, type of def: constant
note:   starting SLP discovery for node 0x63abe60
note:   Build SLP for _3 = vect_a[i_15];
note:   precomputed vectype: vector(4) unsigned int
note:   nunits = 4
note:   SLP discovery for node 0x63abe60 succeeded
note:   SLP discovery for node 0x63abdc0 succeeded
note:   SLP discovery for node 0x63abc80 succeeded
note:   SLP size 3 vs. limit 10.
note:   Final SLP tree for instance 0x6474190:
note:   node 0x63abc80 (max_nunits=4, refcnt=2) vector(4) 

note:   op template: patt_6 = _4 != x_9(D);
note:   stmt 0 patt_6 = _4 != x_9(D);
note:   children 0x63abd20 0x63abdc0
note:   node (external) 0x63abd20 (max_nunits=1, refcnt=1)
note:   { x_9(D) }
note:   node 0x63abdc0 (max_nunits=4, refcnt=2) vector(4) unsigned int
note:   op template: _4 = _3 * 2;
note:   stmt 0 _4 = _3 * 2;
note:   children 0x63abe60 0x63abf00
note:   node 0x63abe60 (max_nunits=4, refcnt=2) vector(4) unsigned int
note:   op template: _3 = vect_a[i_15];
note:   stmt 0 _3 = vect_a[i_15];
note:   load permutation { 0 }
note:   node (constant) 0x63abf00 (max_nunits=1, refcnt=1)
note:   { 2 }

and during codegen:

note:   -->vectorizing SLP node starting from: patt_6 = _4 != x_9(D);
note:   vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, 
+INF] MASK 0x
_3 * 2, type of def: internal
note:   add new stmt: mask_patt_6.18_58 = _53 != vect__4.17_57;
note:=== vectorizable_early_exit ===
note:transform early-exit.
note:   vectorizing stmts using SLP.
note:   Vectorizing SLP tree:
note:   node 0x63abfa0 (max_nunits=4, refcnt=1) vector(4) int
note:   op template: i_12 = i_15 + 1;
note:   stmt 0 i_12 = i_15 + 1;
note:   children 0x63aba00 0x63ac040
note:   node 0x63aba00 (max_nunits=4, refcnt=2) vector(4) int
note:   op template: i_15 = PHI 
note:   [l] stmt 0 i_15 = PHI 
note:   children (nil) (nil)
note:   node (constant) 0x63ac040 (max_nunits=1, refcnt=1) vector(4) int
note:   { 1 }

gcc/ChangeLog:

* tree-vect-loop.cc (vect_analyze_loop_2): Handle SLP trees with no
children.
* tree-vectorizer.h (enum slp_instance_kind): Add 
slp_inst_kind_gcond.
(LOOP_VINFO_EARLY_BREAKS_LIVE_IVS): New.
(vectorizable

[gcc r15-4353] AArch64: re-enable memory access costing after SLP change.

2024-10-15 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:a1540bb843fd1a3e87f50d3f713386eaae454d1c

commit r15-4353-ga1540bb843fd1a3e87f50d3f713386eaae454d1c
Author: Tamar Christina 
Date:   Tue Oct 15 11:22:26 2024 +0100

AArch64: re-enable memory access costing after SLP change.

While chasing down a costing difference between SLP and non-SLP for memory
access costing I noticed that at some point the SLP and non-SLP costing have
diverged.  It used to be we only supported LOAD_LANES in SLP and so the 
non-SLP
costing was working fine.

But with the change to SLP only we now lost costing.

It looks like the vectorizer for non-SLP stores the VMAT type in
STMT_VINFO_MEMORY_ACCESS_TYPE on the stmt_info, but for SLP it stores it in
SLP_TREE_MEMORY_ACCESS_TYPE which is on the SLP node itself.

While my first attempt of a patch was to just also store the VMAT in the
stmt_info https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665295.html
Richi pointed out that this goes wrong when the same access is used Hybrid.

And so we have to do a backend specific fix.  To help out other backends 
this
also introduces a generic helper function suggested by Richi in that patch
(I hope that's ok.. I didn't want to split out just the helper.)

This successfully restores VMAT based costing in the new SLP only world.

gcc/ChangeLog:

* tree-vectorizer.h (vect_mem_access_type): New.
* config/aarch64/aarch64.cc (aarch64_ld234_st234_vectors): Use it.
(aarch64_detect_vector_stmt_subtype): Likewise.
(aarch64_adjust_stmt_cost): Likewise.
(aarch64_vector_costs::count_ops): Likewise.
(aarch64_vector_costs::add_stmt_cost): Make SLP node named.

Diff:
---
 gcc/config/aarch64/aarch64.cc | 54 +++
 gcc/tree-vectorizer.h | 12 ++
 2 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 102680a0efca..5770491b30ce 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16278,7 +16278,7 @@ public:
 private:
   void record_potential_advsimd_unrolling (loop_vec_info);
   void analyze_loop_vinfo (loop_vec_info);
-  void count_ops (unsigned int, vect_cost_for_stmt, stmt_vec_info,
+  void count_ops (unsigned int, vect_cost_for_stmt, stmt_vec_info, slp_tree,
  aarch64_vec_op_count *);
   fractional_cost adjust_body_cost_sve (const aarch64_vec_op_count *,
fractional_cost, unsigned int,
@@ -16595,11 +16595,13 @@ aarch64_builtin_vectorization_cost (enum 
vect_cost_for_stmt type_of_cost,
 }
 }
 
-/* Return true if an access of kind KIND for STMT_INFO represents one
-   vector of an LD[234] or ST[234] operation.  Return the total number of
-   vectors (2, 3 or 4) if so, otherwise return a value outside that range.  */
+/* Return true if an access of kind KIND for STMT_INFO (or NODE if SLP)
+   represents one vector of an LD[234] or ST[234] operation.  Return the total
+   number of vectors (2, 3 or 4) if so, otherwise return a value outside that
+   range.  */
 static int
-aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, stmt_vec_info stmt_info)
+aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
+slp_tree node)
 {
   if ((kind == vector_load
|| kind == unaligned_load
@@ -16609,7 +16611,7 @@ aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, 
stmt_vec_info stmt_info)
 {
   stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
   if (stmt_info
- && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_LOAD_STORE_LANES)
+ && vect_mem_access_type (stmt_info, node) == VMAT_LOAD_STORE_LANES)
return DR_GROUP_SIZE (stmt_info);
 }
   return 0;
@@ -16847,14 +16849,15 @@ aarch64_detect_scalar_stmt_subtype (vec_info *vinfo, 
vect_cost_for_stmt kind,
 }
 
 /* STMT_COST is the cost calculated by aarch64_builtin_vectorization_cost
-   for the vectorized form of STMT_INFO, which has cost kind KIND and which
-   when vectorized would operate on vector type VECTYPE.  Try to subdivide
-   the target-independent categorization provided by KIND to get a more
-   accurate cost.  WHERE specifies where the cost associated with KIND
-   occurs.  */
+   for the vectorized form of STMT_INFO possibly using SLP node NODE, which has
+   cost kind KIND and which when vectorized would operate on vector type
+   VECTYPE.  Try to subdivide the target-independent categorization provided by
+   KIND to get a more accurate cost.  WHERE specifies where the cost associated
+   with KIND occurs.  */
 static fractional_cost
 aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
-   stmt_vec_info stmt_info, tree vectype,
+   stmt_vec_info stmt_info, slp_tree

[gcc r15-4460] AArch64: support encoding integer immediates using floating point moves

2024-10-18 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:87dc6b1992e7ee02e7a4a81c568754198c0f61f5

commit r15-4460-g87dc6b1992e7ee02e7a4a81c568754198c0f61f5
Author: Tamar Christina 
Date:   Fri Oct 18 09:43:45 2024 +0100

AArch64: support encoding integer immediates using floating point moves

This patch extends our immediate SIMD generation cases to support generating
integer immediates using floating point operation if the integer immediate 
maps
to an exact FP value.

As an example:

uint32x4_t f1() {
return vdupq_n_u32(0x3f80);
}

currently generates:

f1:
adrpx0, .LC0
ldr q0, [x0, #:lo12:.LC0]
ret

i.e. a load, but with this change:

f1:
fmovv0.4s, 1.0e+0
ret

Such immediates are common in e.g. our Math routines in glibc because they 
are
created to extract or mark part of an FP immediate as masks.

gcc/ChangeLog:

* config/aarch64/aarch64.cc (aarch64_sve_valid_immediate,
aarch64_simd_valid_immediate): Refactor accepting modes and values.
(aarch64_float_const_representable_p): Refactor and extract FP 
checks
into ...
(aarch64_real_float_const_representable_p): ...This and fix fail
fallback from real_to_integer.
(aarch64_advsimd_valid_immediate): Use it.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/const_create_using_fmov.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64.cc  | 282 +++--
 .../gcc.target/aarch64/const_create_using_fmov.c   |  87 +++
 2 files changed, 241 insertions(+), 128 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 5770491b30ce..e65b24e2ad6a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -22899,19 +22899,19 @@ aarch64_advsimd_valid_immediate_hs (unsigned int 
val32,
   return false;
 }
 
-/* Return true if replicating VAL64 is a valid immediate for the
+/* Return true if replicating VAL64 with mode MODE is a valid immediate for the
Advanced SIMD operation described by WHICH.  If INFO is nonnull,
use it to describe valid immediates.  */
 static bool
 aarch64_advsimd_valid_immediate (unsigned HOST_WIDE_INT val64,
+scalar_int_mode mode,
 simd_immediate_info *info,
 enum simd_immediate_check which)
 {
   unsigned int val32 = val64 & 0x;
-  unsigned int val16 = val64 & 0x;
   unsigned int val8 = val64 & 0xff;
 
-  if (val32 == (val64 >> 32))
+  if (mode != DImode)
 {
   if ((which & AARCH64_CHECK_ORR) != 0
  && aarch64_advsimd_valid_immediate_hs (val32, info, which,
@@ -22924,9 +22924,7 @@ aarch64_advsimd_valid_immediate (unsigned HOST_WIDE_INT 
val64,
return true;
 
   /* Try using a replicated byte.  */
-  if (which == AARCH64_CHECK_MOV
- && val16 == (val32 >> 16)
- && val8 == (val16 >> 8))
+  if (which == AARCH64_CHECK_MOV && mode == QImode)
{
  if (info)
*info = simd_immediate_info (QImode, val8);
@@ -22954,28 +22952,15 @@ aarch64_advsimd_valid_immediate (unsigned 
HOST_WIDE_INT val64,
   return false;
 }
 
-/* Return true if replicating VAL64 gives a valid immediate for an SVE MOV
-   instruction.  If INFO is nonnull, use it to describe valid immediates.  */
+/* Return true if replicating IVAL with MODE gives a valid immediate for an SVE
+   MOV instruction.  If INFO is nonnull, use it to describe valid
+   immediates.  */
 
 static bool
-aarch64_sve_valid_immediate (unsigned HOST_WIDE_INT val64,
+aarch64_sve_valid_immediate (unsigned HOST_WIDE_INT ival, scalar_int_mode mode,
 simd_immediate_info *info)
 {
-  scalar_int_mode mode = DImode;
-  unsigned int val32 = val64 & 0x;
-  if (val32 == (val64 >> 32))
-{
-  mode = SImode;
-  unsigned int val16 = val32 & 0x;
-  if (val16 == (val32 >> 16))
-   {
- mode = HImode;
- unsigned int val8 = val16 & 0xff;
- if (val8 == (val16 >> 8))
-   mode = QImode;
-   }
-}
-  HOST_WIDE_INT val = trunc_int_for_mode (val64, mode);
+  HOST_WIDE_INT val = trunc_int_for_mode (ival, mode);
   if (IN_RANGE (val, -0x80, 0x7f))
 {
   /* DUP with no shift.  */
@@ -22990,7 +22975,7 @@ aarch64_sve_valid_immediate (unsigned HOST_WIDE_INT 
val64,
*info = simd_immediate_info (mode, val);
   return true;
 }
-  if (aarch64_bitmask_imm (val64, mode))
+  if (aarch64_bitmask_imm (ival, mode))
 {
   /* DUPM.  */
   if (info)
@@ -23071,6 +23056,91 @@ aarch64_sve_pred_valid_immediate (rtx x, 
simd_immediate_info *info)
   return false;
 }
 
+/* We can only represent floating point constants which will fit in
+   "quarter-precision" values.  These values are characterised by
+

[gcc r15-4461] AArch64: use movi d0, #0 to clear SVE registers instead of mov z0.d, #0

2024-10-18 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:453d3d90c374d3bb329f1431b7dfb8d0510a88b9

commit r15-4461-g453d3d90c374d3bb329f1431b7dfb8d0510a88b9
Author: Tamar Christina 
Date:   Fri Oct 18 09:44:15 2024 +0100

AArch64: use movi d0, #0 to clear SVE registers instead of mov z0.d, #0

This patch changes SVE to use Adv. SIMD movi 0 to clear SVE registers when 
not
in SVE streaming mode.  As the Neoverse Software Optimization guides 
indicate
SVE mov #0 is not a zero cost move.

When In streaming mode we continue to use SVE's mov to clear the registers.

Tests have already been updated.

gcc/ChangeLog:

* config/aarch64/aarch64.cc (aarch64_output_sve_mov_immediate): Use
fmov for SVE zeros.

Diff:
---
 gcc/config/aarch64/aarch64.cc | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e65b24e2ad6a..3ab550acc7cd 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25516,8 +25516,11 @@ aarch64_output_sve_mov_immediate (rtx const_vector)
}
 }
 
-  snprintf (templ, sizeof (templ), "mov\t%%0.%c, #" HOST_WIDE_INT_PRINT_DEC,
-   element_char, INTVAL (info.u.mov.value));
+  if (info.u.mov.value == const0_rtx && TARGET_NON_STREAMING)
+snprintf (templ, sizeof (templ), "movi\t%%d0, #0");
+  else
+snprintf (templ, sizeof (templ), "mov\t%%0.%c, #" HOST_WIDE_INT_PRINT_DEC,
+ element_char, INTVAL (info.u.mov.value));
   return templ;
 }

[gcc r15-4463] middle-end: Fix GSI for gcond root [PR117140]

2024-10-18 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:51291ad0f1f89a81de917110af96e019dcd5690c

commit r15-4463-g51291ad0f1f89a81de917110af96e019dcd5690c
Author: Tamar Christina 
Date:   Fri Oct 18 10:37:28 2024 +0100

middle-end: Fix GSI for gcond root [PR117140]

When finding the gsi to use for code of the root statements we should use 
the
one of the original statement rather than the gcond which may be inside a
pattern.

Without this the emitted instructions may be discarded later.

gcc/ChangeLog:

PR tree-optimization/117140
* tree-vect-slp.cc (vectorize_slp_instance_root_stmt): Use gsi from
original statement.

gcc/testsuite/ChangeLog:

PR tree-optimization/117140
* gcc.dg/vect/vect-early-break_129-pr117140.c: New test.

Diff:
---
 .../gcc.dg/vect/vect-early-break_129-pr117140.c| 94 ++
 gcc/tree-vect-slp.cc   |  2 +-
 2 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_129-pr117140.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_129-pr117140.c
new file mode 100644
index ..eec7f8db40c7
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_129-pr117140.c
@@ -0,0 +1,94 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+typedef signed char int8_t;
+typedef short int int16_t;
+typedef int int32_t;
+typedef long long int int64_t;
+typedef unsigned char uint8_t;
+typedef short unsigned int uint16_t;
+typedef unsigned int uint32_t;
+typedef long long unsigned int uint64_t;
+
+void __attribute__ ((noinline, noclone))
+test_1_TYPE1_uint32_t (uint16_t *__restrict f, uint32_t *__restrict d,
+   uint16_t x, uint16_t x2, uint32_t y, int n)
+{
+for (int i = 0; i < n; ++i)
+{
+f[i * 2 + 0] = x;
+f[i * 2 + 1] = x2;
+d[i] = y;
+}
+}
+
+void __attribute__ ((noinline, noclone))
+test_1_TYPE1_int64_t (int32_t *__restrict f, int64_t *__restrict d, int32_t x,
+  int32_t x2, int64_t y, int n)
+{
+for (int i = 0; i < n; ++i)
+{
+f[i * 2 + 0] = x;
+f[i * 2 + 1] = x2;
+d[i] = y;
+}
+}
+
+int
+main (void)
+{
+// This part is necessary for ice to appear though running it by 
itself does not trigger an ICE
+int n_3_TYPE1_uint32_t = 32;
+uint16_t x_3_uint16_t = 233;
+uint16_t x2_3_uint16_t = 78;
+uint32_t y_3_uint32_t = 1234;
+uint16_t f_3_uint16_t[33 * 2 + 1] = { 0} ;
+uint32_t d_3_uint32_t[33] = { 0} ;
+test_1_TYPE1_uint32_t (f_3_uint16_t, d_3_uint32_t, x_3_uint16_t, 
x2_3_uint16_t, y_3_uint32_t, n_3_TYPE1_uint32_t);
+for (int i = 0;
+i < n_3_TYPE1_uint32_t;
+++i) {
+if (f_3_uint16_t[i * 2 + 0] != x_3_uint16_t) __builtin_abort 
();
+if (f_3_uint16_t[i * 2 + 1] != x2_3_uint16_t) __builtin_abort 
();
+if (d_3_uint32_t[i] != y_3_uint32_t) __builtin_abort ();
+}
+for (int i = n_3_TYPE1_uint32_t;
+i < n_3_TYPE1_uint32_t + 1;
+++i) {
+if (f_3_uint16_t[i * 2 + 0] != 0) __builtin_abort ();
+if (f_3_uint16_t[i * 2 + 1] != 0) __builtin_abort ();
+if (d_3_uint32_t[i] != 0) __builtin_abort ();
+}
+// If ran without the above section, a different ice appears. see below
+int n_3_TYPE1_int64_t = 32;
+int32_t x_3_int32_t = 233;
+int32_t x2_3_int32_t = 78;
+int64_t y_3_int64_t = 1234;
+int32_t f_3_int32_t[33 * 2 + 1] = { 0 };
+int64_t d_3_int64_t[33] = { 0 };
+test_1_TYPE1_int64_t (f_3_int32_t, d_3_int64_t, x_3_int32_t, x2_3_int32_t,
+  y_3_int64_t, n_3_TYPE1_int64_t);
+for (int i = 0; i < n_3_TYPE1_int64_t; ++i)
+{
+if (f_3_int32_t[i * 2 + 0] != x_3_int32_t)
+__builtin_abort ();
+if (f_3_int32_t[i * 2 + 1] != x2_3_int32_t)
+__builtin_abort ();
+if (d_3_int64_t[i] != y_3_int64_t)
+__builtin_abort ();
+}
+
+for (int i = n_3_TYPE1_int64_t; i < n_3_TYPE1_int64_t + 1; ++i)
+{
+if (f_3_int32_t[i * 2 + 0] != 0)
+__builtin_abort ();
+if (f_3_int32_t[i * 2 + 1] != 0)
+__builtin_abort ();
+if (d_3_int64_t[i] != 0)
+__builtin_abort ();
+}
+
+return 0;
+}
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index d35c2ea02dce..9276662fa0f1 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -11167,7 +11167,7 @@ vectorize_slp_instance_root_stmt (vec_info *vinfo, 
slp_tree node, slp_instance i
 can't support lane

[gcc r15-4459] AArch64: update testsuite to account for new zero moves

2024-10-18 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:fc3507927768c3df425a0b5c0e4051eb8bb1ccf0

commit r15-4459-gfc3507927768c3df425a0b5c0e4051eb8bb1ccf0
Author: Tamar Christina 
Date:   Fri Oct 18 09:42:46 2024 +0100

AArch64: update testsuite to account for new zero moves

The patch series will adjust how zeros are created.  In principal it doesn't
matter the exact lane size a zero gets created on but this makes the tests a
bit fragile.

This preparation patch will update the testsuite to accept multiple variants
of ways to create vector zeros to accept both the current syntax and the one
being transitioned to in the series.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/ldp_stp_18.c: Update zero regexpr.
* gcc.target/aarch64/memset-corner-cases.c: Likewise.
* gcc.target/aarch64/sme/acle-asm/revd_bf16.c: Likewise.
* gcc.target/aarch64/sme/acle-asm/revd_f16.c: Likewise.
* gcc.target/aarch64/sme/acle-asm/revd_f32.c: Likewise.
* gcc.target/aarch64/sme/acle-asm/revd_f64.c: Likewise.
* gcc.target/aarch64/sme/acle-asm/revd_s16.c: Likewise.
* gcc.target/aarch64/sme/acle-asm/revd_s32.c: Likewise.
* gcc.target/aarch64/sme/acle-asm/revd_s64.c: Likewise.
* gcc.target/aarch64/sme/acle-asm/revd_s8.c: Likewise.
* gcc.target/aarch64/sme/acle-asm/revd_u16.c: Likewise.
* gcc.target/aarch64/sme/acle-asm/revd_u32.c: Likewise.
* gcc.target/aarch64/sme/acle-asm/revd_u64.c: Likewise.
* gcc.target/aarch64/sme/acle-asm/revd_u8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/acge_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/acge_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/acge_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/acgt_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/acgt_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/acgt_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/acle_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/acle_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/acle_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/aclt_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/aclt_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/aclt_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/bic_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/bic_u8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/cmpuo_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/cmpuo_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/cmpuo_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_s16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_u64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_u8.c: Likewise.
* gcc.target/aarch64/sve/const_fold_div_1.c: Likewise.
* gcc.target/aarch64/sve/const_fold_mul_1.c: Likewise.
* gcc.target/aarch64/sve/dup_imm_1.c: Likewise.
* gcc.target/aarch64/sve/fdup_1.c: Likewise.
* gcc.target/aarch64/sve/fold_div_zero.c: Likewise.
* gcc.target/aarch64/sve/fold_mul_zero.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_2.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_3.c: Likewise.
* gcc.target/aarch64/sve/pcs/args_4.c: Likewise.
* gcc.target/aarch64/vect-fmovd-zero.c: Likewise.

Diff:
---
 gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c  |  2 +-
 .../gcc.target/aarch64/memset-corner-cases.c   |  2 +-
 .../gcc.target/aarch64/sme/acle-asm/revd_bf16.c|  2 +-
 .../gcc.target/aarch64/sme/acle-asm/revd_f16.c |  2 +-
 .../gcc.target/aarch64/sme/acle-asm/revd_f32.c |  2 +-
 .../gcc.target/aarch64/sme/acle-asm/revd_f64.c |  2 +-
 .../gcc.target/aarch64/sme/acle-asm/revd_s16.c |  2 +-
 .../gcc.target/aarch64/sme/acle-asm/revd_s32.c |  2 +-
 .../gcc.target/aarch64/sme/acle-asm/revd_s64.c |  2 +-
 .../gcc.target/aarch64/sme/acle-asm/revd_s8.c  |  2 +-
 .../gcc.target/aarch64/sme/acle-asm/revd_u16.c |  2 +-
 .../gcc.target/aarch64/sme/acle-asm/revd_u32.c |  2 +-
 .../gcc.target/aarch64/sme/acle-asm/revd_u64.c |  2 +-
 .../gcc.target/aarch64/sme/acle-asm/revd_u8.c  |  2 +-
 .../gc

[gcc r15-4462] middle-end: Fix VEC_PERM_EXPR lowering since relaxation of vector sizes

2024-10-18 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:55f898008ec8235897cf56c89f5599c3ec1bc963

commit r15-4462-g55f898008ec8235897cf56c89f5599c3ec1bc963
Author: Tamar Christina 
Date:   Fri Oct 18 10:36:19 2024 +0100

middle-end: Fix VEC_PERM_EXPR lowering since relaxation of vector sizes

In GCC 14 VEC_PERM_EXPR was relaxed to be able to permute to a 2x larger 
vector
than the size of the input vectors.  However various passes and 
transformations
were not updated to account for this.

I have patches in these area that I will be upstreaming with individual 
patches
that expose them.

This one is that vectlower tries to lower based on the size of the input 
vectors
rather than the size of the output.  As a consequence it creates an invalid
vector of half the size.

Luckily we ICE because the resulting nunits doesn't match the vector size.

gcc/ChangeLog:

* tree-vect-generic.cc (lower_vec_perm): Use output vector size 
instead
of input vector when determining output nunits.

gcc/testsuite/ChangeLog:

* gcc.dg/vec-perm-lower.c: New test.

Diff:
---
 gcc/testsuite/gcc.dg/vec-perm-lower.c | 16 
 gcc/tree-vect-generic.cc  |  7 ---
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vec-perm-lower.c 
b/gcc/testsuite/gcc.dg/vec-perm-lower.c
new file mode 100644
index ..da738fbeed80
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vec-perm-lower.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-fgimple -O2" } */
+
+typedef char v8qi __attribute__ ((vector_size (8)));
+typedef char v16qi __attribute__ ((vector_size (16)));
+
+v16qi __GIMPLE (ssa)
+foo (v8qi a, v8qi b)
+{
+  v16qi _5;
+
+  __BB(2):
+  _5 = __VEC_PERM (a, b, _Literal (unsigned char [[gnu::vector_size(16)]]) { 
_Literal (unsigned char) 0, _Literal (unsigned char) 16, _Literal (unsigned 
char) 1, _Literal (unsigned char) 17, _Literal (unsigned char) 2, _Literal 
(unsigned char) 18, _Literal (unsigned char) 3, _Literal (unsigned char) 19, 
_Literal (unsigned char) 4, _Literal (unsigned char) 20, _Literal (unsigned 
char) 5, _Literal (unsigned char) 21, _Literal (unsigned char) 6, _Literal 
(unsigned char) 22, _Literal (unsigned char) 7, _Literal (unsigned char) 23 });
+  return _5;
+
+}
diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
index 3041fb8fcf23..f86f7eabb255 100644
--- a/gcc/tree-vect-generic.cc
+++ b/gcc/tree-vect-generic.cc
@@ -1500,6 +1500,7 @@ lower_vec_perm (gimple_stmt_iterator *gsi)
   tree mask = gimple_assign_rhs3 (stmt);
   tree vec0 = gimple_assign_rhs1 (stmt);
   tree vec1 = gimple_assign_rhs2 (stmt);
+  tree res_vect_type = TREE_TYPE (gimple_assign_lhs (stmt));
   tree vect_type = TREE_TYPE (vec0);
   tree mask_type = TREE_TYPE (mask);
   tree vect_elt_type = TREE_TYPE (vect_type);
@@ -1512,7 +1513,7 @@ lower_vec_perm (gimple_stmt_iterator *gsi)
   location_t loc = gimple_location (gsi_stmt (*gsi));
   unsigned i;
 
-  if (!TYPE_VECTOR_SUBPARTS (vect_type).is_constant (&elements))
+  if (!TYPE_VECTOR_SUBPARTS (res_vect_type).is_constant (&elements))
 return;
 
   if (TREE_CODE (mask) == SSA_NAME)
@@ -1672,9 +1673,9 @@ lower_vec_perm (gimple_stmt_iterator *gsi)
 }
 
   if (constant_p)
-constr = build_vector_from_ctor (vect_type, v);
+constr = build_vector_from_ctor (res_vect_type, v);
   else
-constr = build_constructor (vect_type, v);
+constr = build_constructor (res_vect_type, v);
   gimple_assign_set_rhs_from_tree (gsi, constr);
   update_stmt (gsi_stmt (*gsi));
 }

[gcc r15-4326] AArch64: rename the SVE2 psel intrinsics to psel_lane [PR116371]

2024-10-14 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:306834b7f74ab61160f205e04f5bf35b71f9ec52

commit r15-4326-g306834b7f74ab61160f205e04f5bf35b71f9ec52
Author: Tamar Christina 
Date:   Mon Oct 14 13:58:09 2024 +0100

AArch64: rename the SVE2 psel intrinsics to psel_lane [PR116371]

The psel intrinsics. similar to the pext, should be name psel_lane.  This
corrects the naming.

gcc/ChangeLog:

PR target/116371
* config/aarch64/aarch64-sve-builtins-sve2.cc (class svpsel_impl):
Renamed to ...
(class svpsel_lane_impl): ... This and adjust initialization.
* config/aarch64/aarch64-sve-builtins-sve2.def (svpsel): Renamed to 
...
(svpsel_lane): ... This.
* config/aarch64/aarch64-sve-builtins-sve2.h (svpsel): Renamed to
svpsel_lane.

gcc/testsuite/ChangeLog:

PR target/116371
* gcc.target/aarch64/sme2/acle-asm/psel_b16.c,
gcc.target/aarch64/sme2/acle-asm/psel_b32.c,
gcc.target/aarch64/sme2/acle-asm/psel_b64.c,
gcc.target/aarch64/sme2/acle-asm/psel_b8.c,
gcc.target/aarch64/sme2/acle-asm/psel_c16.c,
gcc.target/aarch64/sme2/acle-asm/psel_c32.c,
gcc.target/aarch64/sme2/acle-asm/psel_c64.c,
gcc.target/aarch64/sme2/acle-asm/psel_c8.c: Renamed to
* gcc.target/aarch64/sme2/acle-asm/psel_lane_b16.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_b32.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_b64.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_b8.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_c16.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_c32.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_c64.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_c8.c: ... These.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-sve2.cc|  4 +-
 gcc/config/aarch64/aarch64-sve-builtins-sve2.def   |  2 +-
 gcc/config/aarch64/aarch64-sve-builtins-sve2.h |  2 +-
 .../gcc.target/aarch64/sme2/acle-asm/psel_b16.c| 89 --
 .../gcc.target/aarch64/sme2/acle-asm/psel_b32.c| 89 --
 .../gcc.target/aarch64/sme2/acle-asm/psel_b64.c| 80 ---
 .../gcc.target/aarch64/sme2/acle-asm/psel_b8.c | 89 --
 .../gcc.target/aarch64/sme2/acle-asm/psel_c16.c| 89 --
 .../gcc.target/aarch64/sme2/acle-asm/psel_c32.c| 89 --
 .../gcc.target/aarch64/sme2/acle-asm/psel_c64.c| 80 ---
 .../gcc.target/aarch64/sme2/acle-asm/psel_c8.c | 89 --
 .../aarch64/sme2/acle-asm/psel_lane_b16.c  | 89 ++
 .../aarch64/sme2/acle-asm/psel_lane_b32.c  | 89 ++
 .../aarch64/sme2/acle-asm/psel_lane_b64.c  | 80 +++
 .../aarch64/sme2/acle-asm/psel_lane_b8.c   | 89 ++
 .../aarch64/sme2/acle-asm/psel_lane_c16.c  | 89 ++
 .../aarch64/sme2/acle-asm/psel_lane_c32.c  | 89 ++
 .../aarch64/sme2/acle-asm/psel_lane_c64.c  | 80 +++
 .../aarch64/sme2/acle-asm/psel_lane_c8.c   | 89 ++
 19 files changed, 698 insertions(+), 698 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
index 146a5459930f..6a20a613f832 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
@@ -234,7 +234,7 @@ public:
   }
 };
 
-class svpsel_impl : public function_base
+class svpsel_lane_impl : public function_base
 {
 public:
   rtx
@@ -625,7 +625,7 @@ FUNCTION (svpmullb, unspec_based_function, (-1, 
UNSPEC_PMULLB, -1))
 FUNCTION (svpmullb_pair, unspec_based_function, (-1, UNSPEC_PMULLB_PAIR, -1))
 FUNCTION (svpmullt, unspec_based_function, (-1, UNSPEC_PMULLT, -1))
 FUNCTION (svpmullt_pair, unspec_based_function, (-1, UNSPEC_PMULLT_PAIR, -1))
-FUNCTION (svpsel, svpsel_impl,)
+FUNCTION (svpsel_lane, svpsel_lane_impl,)
 FUNCTION (svqabs, rtx_code_function, (SS_ABS, UNKNOWN, UNKNOWN))
 FUNCTION (svqcadd, svqcadd_impl,)
 FUNCTION (svqcvt, integer_conversion, (UNSPEC_SQCVT, UNSPEC_SQCVTU,
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.def 
b/gcc/config/aarch64/aarch64-sve-builtins-sve2.def
index 4543402f836f..318dfff06f0d 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.def
+++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.def
@@ -235,7 +235,7 @@ DEF_SVE_FUNCTION (svsm4ekey, binary, s_unsigned, none)
 | AARCH64_FL_SME \
 | AARCH64_FL_SM_ON)
 DEF_SVE_FUNCTION (svclamp, clamp, all_integer, none)
-DEF_SVE_FUNCTION (svpsel, select_pred, all_pred_count, none)
+DEF_SVE_FUNCTION (svpsel_lane, select_pred, all_pred_count, none)
 DEF_SVE_FUNCTION (svre

[gcc r15-4327] simplify-rtx: Fix incorrect folding of shift and AND [PR117012]

2024-10-14 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:be966baa353dfcc20b76b5a5586ab2494bb0a735

commit r15-4327-gbe966baa353dfcc20b76b5a5586ab2494bb0a735
Author: Tamar Christina 
Date:   Mon Oct 14 14:00:25 2024 +0100

simplify-rtx: Fix incorrect folding of shift and AND [PR117012]

The optimization added in r15-1047-g7876cde25cbd2f is using the wrong
operaiton to check for uniform constant vectors.

The Author intended to check that all the lanes in the vector are the same 
and
so used CONST_VECTOR_DUPLICATE_P.  However this only checks that the vector
is created from a pattern duplication, but doesn't say how many pattern
alternatives make up the duplication.  Normally would would need to check 
this
separately or use const_vec_duplicate_p.

Without this the optimization incorrectly triggers.

gcc/ChangeLog:

PR rtl-optimization/117012
* simplify-rtx.cc (simplify_context::simplify_binary_operation_1): 
Use
const_vec_duplicate_p instead of CONST_VECTOR_DUPLICATE_P.

gcc/testsuite/ChangeLog:

PR rtl-optimization/117012
* gcc.target/aarch64/pr117012.c: New test.

Diff:
---
 gcc/simplify-rtx.cc |  4 ++--
 gcc/testsuite/gcc.target/aarch64/pr117012.c | 16 
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index dc0d192dd218..4d024ec523b1 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -4088,10 +4088,10 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
   if (VECTOR_MODE_P (mode) && GET_CODE (op0) == ASHIFTRT
  && (CONST_INT_P (XEXP (op0, 1))
  || (GET_CODE (XEXP (op0, 1)) == CONST_VECTOR
- && CONST_VECTOR_DUPLICATE_P (XEXP (op0, 1))
+ && const_vec_duplicate_p (XEXP (op0, 1))
  && CONST_INT_P (XVECEXP (XEXP (op0, 1), 0, 0
  && GET_CODE (op1) == CONST_VECTOR
- && CONST_VECTOR_DUPLICATE_P (op1)
+ && const_vec_duplicate_p (op1)
  && CONST_INT_P (XVECEXP (op1, 0, 0)))
{
  unsigned HOST_WIDE_INT shift_count
diff --git a/gcc/testsuite/gcc.target/aarch64/pr117012.c 
b/gcc/testsuite/gcc.target/aarch64/pr117012.c
new file mode 100644
index ..537c0fa566c6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr117012.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#define vector16 __attribute__((vector_size(16)))
+
+vector16 unsigned char
+g (vector16 unsigned char a)
+{
+  vector16 signed char b = (vector16 signed char)a;
+  b = b >> 7;
+  vector16 unsigned char c = (vector16 unsigned char)b;
+  vector16 unsigned char d = { 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0 
};
+  return c & d;
+}
+
+/* { dg-final { scan-assembler-times {and\tv[0-9]+\.16b, v[0-9]+\.16b, 
v[0-9]+\.16b} 1 } } */

[gcc r15-4328] middle-end: copy STMT_VINFO_STRIDED_P when DR is replaced [PR116956]

2024-10-14 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:ec3d3ea60a55f25a743a037adda7d10d03ca73b2

commit r15-4328-gec3d3ea60a55f25a743a037adda7d10d03ca73b2
Author: Tamar Christina 
Date:   Mon Oct 14 14:01:24 2024 +0100

middle-end: copy STMT_VINFO_STRIDED_P when DR is replaced [PR116956]

When move_dr copies a DR from one statement to another, it seems we've
forgotten to copy the STMT_VINFO_STRIDED_P flag.  This leaves the new DR in 
a
broken state where it has a non constant stride but isn't marked as strided.

This causes the ICE in the PR because dataref analysis fails during epilogue
vectorization because there is an assumption in place that while costing may
fail for epiloque vectorization, that DR analysis cannot if it succeeded for
the main loop.

gcc/ChangeLog:

PR tree-optimization/116956
* tree-vectorizer.cc (vec_info::move_dr): Copy STMT_VINFO_STRIDED_P.

gcc/testsuite/ChangeLog:

PR tree-optimization/116956
* gfortran.dg/vect/pr116956.f90: New test.

Diff:
---
 gcc/testsuite/gfortran.dg/vect/pr116956.f90 | 11 +++
 gcc/tree-vectorizer.cc  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/gcc/testsuite/gfortran.dg/vect/pr116956.f90 
b/gcc/testsuite/gfortran.dg/vect/pr116956.f90
new file mode 100644
index ..3ce4d1ab7927
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/vect/pr116956.f90
@@ -0,0 +1,11 @@
+! { dg-do compile }
+! { dg-require-effective-target vect_int }
+! { dg-additional-options "-mcpu=neoverse-v2 -Ofast" { target aarch64*-*-* } }
+
+SUBROUTINE nesting_offl_init(u, v, mask)
+   IMPLICIT NONE
+   real :: u(:)
+   real :: v(:)
+   integer :: mask(:)
+   u = MERGE( u, v, BTEST (mask, 1) )
+END SUBROUTINE nesting_offl_init
diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
index fed12c41f9cb..0c471c5580d3 100644
--- a/gcc/tree-vectorizer.cc
+++ b/gcc/tree-vectorizer.cc
@@ -610,6 +610,8 @@ vec_info::move_dr (stmt_vec_info new_stmt_info, 
stmt_vec_info old_stmt_info)
 = STMT_VINFO_DR_WRT_VEC_LOOP (old_stmt_info);
   STMT_VINFO_GATHER_SCATTER_P (new_stmt_info)
 = STMT_VINFO_GATHER_SCATTER_P (old_stmt_info);
+  STMT_VINFO_STRIDED_P (new_stmt_info)
+= STMT_VINFO_STRIDED_P (old_stmt_info);
 }
 
 /* Permanently remove the statement described by STMT_INFO from the

[gcc r14-10909] AArch64: backport Neoverse and Cortex CPU definitions

2024-11-08 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:05d54bcdc5395a9d3df36c8b640579a0558c89f0

commit r14-10909-g05d54bcdc5395a9d3df36c8b640579a0558c89f0
Author: Tamar Christina 
Date:   Fri Nov 8 18:12:32 2024 +

AArch64: backport Neoverse and Cortex CPU definitions

This is a conservative backport of a few core definitions backporting only 
the
core definitions and mapping them to their closest cost model that exist on 
the
branches.

gcc/ChangeLog:

* config/aarch64/aarch64-cores.def (cortex-a725, cortex-x925,
neoverse-n3, neoverse-v3, neoverse-v3ae): New.
* config/aarch64/aarch64-tune.md: Regenerate
* doc/invoke.texi: Document them.

Diff:
---
 gcc/config/aarch64/aarch64-cores.def |  6 ++
 gcc/config/aarch64/aarch64-tune.md   |  2 +-
 gcc/doc/invoke.texi  | 10 ++
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 1ab09ea5f720..a919ab7d8a5a 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -179,6 +179,7 @@ AARCH64_CORE("cortex-a710",  cortexa710, cortexa57, V9A,  
(SVE2_BITPERM, MEMTAG,
 AARCH64_CORE("cortex-a715",  cortexa715, cortexa57, V9A,  (SVE2_BITPERM, 
MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd4d, -1)
 
 AARCH64_CORE("cortex-a720",  cortexa720, cortexa57, V9_2A,  (SVE2_BITPERM, 
MEMTAG, PROFILE), neoversen2, 0x41, 0xd81, -1)
+AARCH64_CORE("cortex-a725",  cortexa725, cortexa57, V9_2A, (SVE2_BITPERM, 
MEMTAG, PROFILE), neoversen2, 0x41, 0xd87, -1)
 
 AARCH64_CORE("cortex-x2",  cortexx2, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
I8MM, BF16), neoversen2, 0x41, 0xd48, -1)
 
@@ -186,11 +187,16 @@ AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  
(SVE2_BITPERM, MEMTAG, I8M
 
 AARCH64_CORE("cortex-x4",  cortexx4, cortexa57, V9_2A,  (SVE2_BITPERM, MEMTAG, 
PROFILE), neoversen2, 0x41, 0xd81, -1)
 
+AARCH64_CORE("cortex-x925", cortexx925, cortexa57, V9_2A,  (SVE2_BITPERM, 
MEMTAG, PROFILE), neoversen2, 0x41, 0xd85, -1)
+
 AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1)
 AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1)
+AARCH64_CORE("neoverse-n3", neoversen3, cortexa57, V9_2A, (SVE2_BITPERM, RNG, 
MEMTAG, PROFILE), neoversen2, 0x41, 0xd8e, -1)
 
 AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 AARCH64_CORE("grace", grace, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE), neoversev2, 0x41, 0xd4f, -1)
+AARCH64_CORE("neoverse-v3", neoversev3, cortexa57, V9_2A, (SVE2_BITPERM, RNG, 
LS64, MEMTAG, PROFILE), neoversev2, 0x41, 0xd84, -1)
+AARCH64_CORE("neoverse-v3ae", neoversev3ae, cortexa57, V9_2A, (SVE2_BITPERM, 
RNG, LS64, MEMTAG, PROFILE), neoversev2, 0x41, 0xd83, -1)
 
 AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index 06e8680607bd..35b27ddb8831 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,fujitsu_monaka,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,demeter,generic,generic_armv8_a,generic_armv9_a"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,fujitsu_monaka,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cort

[gcc r15-4802] middle-end: Lower all gconds during vector pattern matching [PR117176]

2024-10-31 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:d2f9159cfe7ea904e6476cabefea0c6ac9532e29

commit r15-4802-gd2f9159cfe7ea904e6476cabefea0c6ac9532e29
Author: Tamar Christina 
Date:   Thu Oct 31 12:50:23 2024 +

middle-end: Lower all gconds during vector pattern matching [PR117176]

I have been taking a look at boolean handing once more in the vectorizer.

There are two situation to consider:

  1. when the boolean being created are created from comparing data inputs 
then
 for the resulting vector boolean we need to know the vector type and 
the
 precision.  In this case, when we have an operation such as NOT on the 
data
 element, this has to be lowered to XOR because the truncation to the 
vector
 precision needs to be explicit.
  2. when the boolean being created comes from another boolean operation, 
then
 we don't need to lower NOT, as the precision doesn't change.  We don't 
do
 any lowering for these (as denoted in check_bool_pattern) and instead 
the
 precision is copied from the element feeding the boolean statement 
during
 VF analysis.

For early break gcond lowering in order to correctly handle the second 
scenario
above we punted the lowering of VECT_SCALAR_BOOLEAN_TYPE_P comparisons that 
were
already in the right shape.  e.g. e != 0 where e is a boolean does not need 
any
lowering.

The issue however is that the statement feeding e may need to be lowered in 
the
case where it's a data expression.

This patch changes a bit how we do the lowering.  We now always emit an
additional compare. e.g. if the input is;

  if (e != 0)

where is a boolean we would punt on thi before, but now we generate

  f = e != 0
  if (f != 0)

We then use the same infrastructre as recog_bool to ask it to lower f, and 
in
doing so handle and boolean conversions that need to be lowered.

Because we now guarantee that f is an internal def we can also simplify the
SLP building code.

When e is a boolean, the precision we build for f needs to reflect the 
precision
of the operation feeding e.  To get this value we use integer_type_for_mask 
the
same way recog_bool does, and if it's defined (e.g. we have a data 
conversions
somewhere) we pass that precision on instead.  This gets us the correct VF
on the newly lowered boolean expressions.

gcc/ChangeLog:

PR tree-optimization/117176
* tree-vect-patterns.cc (vect_recog_gcond_pattern): Lower all 
gconds.
* tree-vect-slp.cc (vect_analyze_slp): No longer check for in vect 
def.

gcc/testsuite/ChangeLog:

PR tree-optimization/117176
* gcc.dg/vect/vect-early-break_130-pr117176.c: New test.

Diff:
---
 .../gcc.dg/vect/vect-early-break_130-pr117176.c| 21 
 gcc/tree-vect-patterns.cc  | 19 ++-
 gcc/tree-vect-slp.cc   | 39 +-
 3 files changed, 40 insertions(+), 39 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_130-pr117176.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_130-pr117176.c
new file mode 100644
index ..841dcce284dd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_130-pr117176.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+struct ColorSpace {
+  int componentCt;
+};
+
+struct Psnr {
+  double psnr[3];
+};
+
+int f(struct Psnr psnr, struct ColorSpace colorSpace) {
+  int i, hitsTarget = 1;
+
+  for (i = 1; i < colorSpace.componentCt && hitsTarget; ++i)
+hitsTarget = !(psnr.psnr[i] < 1);
+
+  return hitsTarget;
+}
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 945e7d2dc45d..a708234304fe 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -5426,17 +5426,19 @@ vect_recog_gcond_pattern (vec_info *vinfo,
   if (VECTOR_TYPE_P (scalar_type))
 return NULL;
 
-  if (code == NE_EXPR
-  && zerop (rhs)
-  && VECT_SCALAR_BOOLEAN_TYPE_P (scalar_type))
-return NULL;
+  /* If the input is a boolean then try to figure out the precision that the
+ vector type should use.  We cannot use the scalar precision as this would
+ later mismatch.  This is similar to what recog_bool does.  */
+  if (VECT_SCALAR_BOOLEAN_TYPE_P (scalar_type))
+{
+  if (tree stype = integer_type_for_mask (lhs, vinfo))
+   scalar_type = stype;
+}
 
-  tree vecitype = get_vectype_for_scalar_type (vinfo, scalar_type);
-  if (vecitype == NULL_TREE)
+  tree vectype = get_mask_type_for_scalar_type (vinfo, scalar_type);
+  if (vectype == NULL_TREE)
 return NULL;
 
-  tree vectype = truth_type_for (vecitype);
-
   tree new_lhs = vect_recog_temp_ssa_var (boolean_type_node, NULL

[gcc r15-3792] middle-end: Insert invariant instructions before the gsi [PR116812]

2024-09-23 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:09892448ebd8c396a26b2c09ba71f1e5a8dc42d7

commit r15-3792-g09892448ebd8c396a26b2c09ba71f1e5a8dc42d7
Author: Tamar Christina 
Date:   Mon Sep 23 11:45:43 2024 +0100

middle-end: Insert invariant instructions before the gsi [PR116812]

The new invariant statements should be inserted before the current
statement and not after.  This goes fine 99% of the time but when the
current statement is a gcond the control flow gets corrupted.

gcc/ChangeLog:

PR tree-optimization/116812
* tree-vect-slp.cc (vect_slp_region): Fix insertion.

gcc/testsuite/ChangeLog:

PR tree-optimization/116812
* gcc.dg/vect/pr116812.c: New test.

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr116812.c | 17 +
 gcc/tree-vect-slp.cc |  6 ++
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr116812.c 
b/gcc/testsuite/gcc.dg/vect/pr116812.c
new file mode 100644
index ..3e83c13d94bd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr116812.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -fno-tree-dce -fno-tree-dse" } */
+
+int a, b, c, d, e, f[2], g, h;
+int k(int j) { return 2 >> a ? 2 >> a : a; }
+int main() {
+  int i;
+  for (; g; g = k(d = 0))
+;
+  if (a)
+b && h;
+  for (e = 0; e < 2; e++)
+c = d & 1 ? d : 0;
+  for (i = 0; i < 2; i++)
+f[i] = 0;
+  return 0;
+}
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 600987dd6e5d..7161492f5114 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -9168,10 +9168,8 @@ vect_slp_region (vec bbs, 
vec datarefs,
dump_printf_loc (MSG_NOTE, vect_location,
 "-->generating invariant statements\n");
 
- gimple_stmt_iterator gsi;
- gsi = gsi_after_labels (bb_vinfo->bbs[0]);
- gsi_insert_seq_after (&gsi, bb_vinfo->inv_pattern_def_seq,
-   GSI_CONTINUE_LINKING);
+ bb_vinfo->insert_seq_on_entry (NULL,
+bb_vinfo->inv_pattern_def_seq);
}
}
   else

[gcc r15-3767] aarch64: Take into account when VF is higher than known scalar iters

2024-09-22 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:e84e5d034124c6733d3b36d8623c56090d4d17f7

commit r15-3767-ge84e5d034124c6733d3b36d8623c56090d4d17f7
Author: Tamar Christina 
Date:   Sun Sep 22 13:34:10 2024 +0100

aarch64: Take into account when VF is higher than known scalar iters

Consider low overhead loops like:

void
foo (char *restrict a, int *restrict b, int *restrict c, int n)
{
  for (int i = 0; i < 9; i++)
{
  int res = c[i];
  int t = b[i];
  if (a[i] != 0)
res = t;
  c[i] = res;
}
}

For such loops we use latency only costing since the loop bounds is known 
and
small.

The current costing however does not consider the case where niters < VF.

So when comparing the scalar vs vector costs it doesn't keep in mind that 
the
scalar code can't perform VF iterations.  This makes it overestimate the 
cost
for the scalar loop and we incorrectly vectorize.

This patch takes the minimum of the VF and niters in such cases.
Before the patch we generate:

 note:  Original vector body cost = 46
 note:  Vector loop iterates at most 1 times
 note:  Scalar issue estimate:
 note:load operations = 2
 note:store operations = 1
 note:general operations = 1
 note:reduction latency = 0
 note:estimated min cycles per iteration = 1.00
 note:estimated cycles per vector iteration (for VF 32) = 32.00
 note:  SVE issue estimate:
 note:load operations = 5
 note:store operations = 4
 note:general operations = 11
 note:predicate operations = 12
 note:reduction latency = 0
 note:estimated min cycles per iteration without predication = 5.50
 note:estimated min cycles per iteration for predication = 12.00
 note:estimated min cycles per iteration = 12.00
 note:  Low iteration count, so using pure latency costs
 note:  Cost model analysis:

vs after:

 note:  Original vector body cost = 46
 note:  Known loop bounds, capping VF to 9 for analysis
 note:  Vector loop iterates at most 1 times
 note:  Scalar issue estimate:
 note:load operations = 2
 note:store operations = 1
 note:general operations = 1
 note:reduction latency = 0
 note:estimated min cycles per iteration = 1.00
 note:estimated cycles per vector iteration (for VF 9) = 9.00
 note:  SVE issue estimate:
 note:load operations = 5
 note:store operations = 4
 note:general operations = 11
 note:predicate operations = 12
 note:reduction latency = 0
 note:estimated min cycles per iteration without predication = 5.50
 note:estimated min cycles per iteration for predication = 12.00
 note:estimated min cycles per iteration = 12.00
 note:  Increasing body cost to 1472 because the scalar code could issue 
within the limit imposed by predicate operations
 note:  Low iteration count, so using pure latency costs
 note:  Cost model analysis:

gcc/ChangeLog:

* config/aarch64/aarch64.cc (adjust_body_cost):
Cap VF for low iteration loops.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/asrdiv_4.c: Update bounds.
* gcc.target/aarch64/sve/cond_asrd_2.c: Likewise.
* gcc.target/aarch64/sve/cond_uxt_6.c: Likewise.
* gcc.target/aarch64/sve/cond_uxt_7.c: Likewise.
* gcc.target/aarch64/sve/cond_uxt_8.c: Likewise.
* gcc.target/aarch64/sve/miniloop_1.c: Likewise.
* gcc.target/aarch64/sve/spill_6.c: Likewise.
* gcc.target/aarch64/sve/sve_iters_low_1.c: New test.
* gcc.target/aarch64/sve/sve_iters_low_2.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64.cc| 13 +
 gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c  | 12 ++--
 gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c   | 12 ++--
 gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c|  8 
 gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c|  8 
 gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c|  8 
 gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c|  2 +-
 gcc/testsuite/gcc.target/aarch64/sve/spill_6.c   |  8 
 .../gcc.target/aarch64/sve/sve_iters_low_1.c | 17 +
 .../gcc.target/aarch64/sve/sve_iters_low_2.c | 20 
 10 files changed, 79 insertions(+), 29 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 92763d403c75..68913beaee20 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -17565,6 +17565,19 @@ adjust_body_cost (loop_vec_info loop_vinfo,
 dump_printf_loc (MSG_NOTE, vect_location,
 "Origina

[gcc r15-3768] middle-end: lower COND_EXPR into gimple form in vect_recog_bool_pattern

2024-09-22 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:4150bcd205ebb60b949224758c05012c0dfab7a7

commit r15-3768-g4150bcd205ebb60b949224758c05012c0dfab7a7
Author: Tamar Christina 
Date:   Sun Sep 22 13:38:49 2024 +0100

middle-end: lower COND_EXPR into gimple form in vect_recog_bool_pattern

Currently the vectorizer cheats when lowering COND_EXPR during bool recog.
In the cases where the conditonal is loop invariant or non-boolean it 
instead
converts the operation back into GENERIC and hides much of the operation 
from
the analysis part of the vectorizer.

i.e.

  a ? b : c

is transformed into:

  a != 0 ? b : c

however by doing so we can't perform any optimization on the mask as they 
aren't
explicit until quite late during codegen.

To fix this this patch lowers booleans earlier and so ensures that we are 
always
in GIMPLE.

For when the value is a loop invariant boolean we have to generate an 
additional
conversion from bool to the integer mask form.

This is done by creating a loop invariant a ? -1 : 0 with the target mask
precision and then doing a normal != 0 comparison on that.

To support this the patch also adds the ability to during pattern matching
create a loop invariant pattern that won't be seen by the vectorizer and 
will
instead me materialized inside the loop preheader in the case of loops, or 
in
the case of BB vectorization it materializes it in the first BB in the 
region.

gcc/ChangeLog:

* tree-vect-patterns.cc (append_inv_pattern_def_seq): New.
(vect_recog_bool_pattern): Lower COND_EXPRs.
* tree-vect-slp.cc (vect_slp_region): Materialize loop invariant
statements.
* tree-vect-loop.cc (vect_transform_loop): Likewise.
* tree-vect-stmts.cc (vectorizable_comparison_1): Remove
VECT_SCALAR_BOOLEAN_TYPE_P handling for vectype.
* tree-vectorizer.cc (vec_info::vec_info): Initialize
inv_pattern_def_seq.
* tree-vectorizer.h (LOOP_VINFO_INV_PATTERN_DEF_SEQ): New.
(class vec_info): Add inv_pattern_def_seq.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/bb-slp-conditional_store_1.c: New test.
* gcc.dg/vect/vect-conditional_store_5.c: New test.
* gcc.dg/vect/vect-conditional_store_6.c: New test.

Diff:
---
 .../gcc.dg/vect/bb-slp-conditional_store_1.c   | 15 +
 .../gcc.dg/vect/vect-conditional_store_5.c | 28 
 .../gcc.dg/vect/vect-conditional_store_6.c | 24 +
 gcc/tree-vect-loop.cc  | 12 +++
 gcc/tree-vect-patterns.cc  | 39 --
 gcc/tree-vect-slp.cc   | 14 
 gcc/tree-vect-stmts.cc |  6 +---
 gcc/tree-vectorizer.cc |  3 +-
 gcc/tree-vectorizer.h  |  7 
 9 files changed, 139 insertions(+), 9 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-conditional_store_1.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-conditional_store_1.c
new file mode 100644
index ..650a3bfbfb1d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-conditional_store_1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_float } */
+
+/* { dg-additional-options "-mavx2" { target avx2 } } */
+/* { dg-additional-options "-march=armv9-a" { target aarch64-*-* } } */
+
+void foo3 (float *restrict a, int *restrict c)
+{
+#pragma GCC unroll 8
+  for (int i = 0; i < 8; i++)
+c[i] = a[i] > 1.0;
+}
+
+/* { dg-final { scan-tree-dump "vectorized using SLP" "slp1" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_5.c 
b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_5.c
new file mode 100644
index ..37d60fa76351
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_5.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_masked_store } */
+
+/* { dg-additional-options "-mavx2" { target avx2 } } */
+/* { dg-additional-options "-march=armv9-a" { target aarch64-*-* } } */
+
+#include 
+
+void foo3 (float *restrict a, int *restrict b, int *restrict c, int n, int 
stride)
+{
+  if (stride <= 1)
+return;
+
+  bool ai = a[0];
+
+  for (int i = 0; i < n; i++)
+{
+  int res = c[i];
+  int t = b[i+stride];
+  if (ai)
+t = res;
+  c[i] = t;
+}
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump-not "VEC_COND_EXPR " "vect" { target 
aarch64-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_6.c 
b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_6.c
new file mode 100644
index ..5e1aedf3726b
--- /

[gcc r15-3800] aarch64: store signing key and signing method in DWARF _Unwind_FrameState

2024-09-23 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:f531673917e4f80ad51eda0d806f0479c501a907

commit r15-3800-gf531673917e4f80ad51eda0d806f0479c501a907
Author: Matthieu Longo 
Date:   Mon Sep 23 15:03:30 2024 +0100

aarch64: store signing key and signing method in DWARF _Unwind_FrameState

This patch is only a refactoring of the existing implementation
of PAuth and returned-address signing. The existing behavior is
preserved.

_Unwind_FrameState already contains several CIE and FDE information
(see the attributes below the comment "The information we care
about from the CIE/FDE" in libgcc/unwind-dw2.h).
The patch aims at moving the information from DWARF CIE (signing
key stored in the augmentation string) and FDE (the used signing
method) into _Unwind_FrameState along the already-stored CIE and
FDE information.
Note: those information have to be saved in frame_state_reg_info
instead of _Unwind_FrameState as they need to be savable by
DW_CFA_remember_state and restorable by DW_CFA_restore_state, that
both rely on the attribute "prev".

Those new information in _Unwind_FrameState simplifies the look-up
of the signing key when the return address is demangled. It also
allows future signing methods to be easily added.

_Unwind_FrameState is not a part of the public API of libunwind,
so the change is backward compatible.

A new architecture-specific handler MD_ARCH_EXTENSION_FRAME_INIT
allows to reset values (if needed) in the frame state and unwind
context before changing the frame state to the caller context.

A new architecture-specific handler MD_ARCH_EXTENSION_CIE_AUG_HANDLER
isolates the architecture-specific augmentation strings in AArch64
backend, and allows others architectures to reuse augmentation
strings that would have clashed with AArch64 DWARF extensions.

aarch64_demangle_return_addr, DW_CFA_AARCH64_negate_ra_state and
DW_CFA_val_expression cases in libgcc/unwind-dw2-execute_cfa.h
were documented to clarify where the value of the RA state register
is stored (FS and CONTEXT respectively).

libgcc/ChangeLog:

* config/aarch64/aarch64-unwind.h
(AARCH64_DWARF_RA_STATE_MASK): The mask for RA state register.
(aarch64_ra_signing_method_t): The diversifiers used to sign a
function's return address.
(aarch64_pointer_auth_key): The key used to sign a function's
return address.
(aarch64_cie_signed_with_b_key): Deleted as the signing key is
available now in _Unwind_FrameState.
(MD_ARCH_EXTENSION_CIE_AUG_HANDLER): New CIE augmentation string
handler for architecture extensions.
(MD_ARCH_EXTENSION_FRAME_INIT): New architecture-extension
initialization routine for DWARF frame state and context before
execution of DWARF instructions.
(aarch64_context_ra_state_get): Read RA state register from CONTEXT.
(aarch64_ra_state_get): Read RA state register from FS.
(aarch64_ra_state_set): Write RA state register into FS.
(aarch64_ra_state_toggle): Toggle RA state register in FS.
(aarch64_cie_aug_handler): Handler AArch64 augmentation strings.
(aarch64_arch_extension_frame_init): Initialize defaults for the
signing key (PAUTH_KEY_A), and RA state register (RA_no_signing).
(aarch64_demangle_return_addr): Rely on the frame registers and
the signing_key attribute in _Unwind_FrameState.
* unwind-dw2-execute_cfa.h:
Use the right alias DW_CFA_AARCH64_negate_ra_state for __aarch64__
instead of DW_CFA_GNU_window_save.
(DW_CFA_AARCH64_negate_ra_state): Save the signing method in RA
state register. Toggle RA state register without resetting 'how'
to REG_UNSAVED.
* unwind-dw2.c:
(extract_cie_info): Save the signing key in the current
_Unwind_FrameState while parsing the augmentation data.
(uw_frame_state_for): Reset some attributes related to architecture
extensions in _Unwind_FrameState.
(uw_update_context): Move authentication code to AArch64 unwinding.
* unwind-dw2.h (enum register_rule): Give a name to the existing
enum for the register rules, and replace 'unsigned char' by 'enum
register_rule' to facilitate debugging in GDB.
(_Unwind_FrameState): Add a new architecture-extension attribute
to store the signing key.

Diff:
---
 libgcc/config/aarch64/aarch64-unwind.h | 145 +++--
 libgcc/unwind-dw2-execute_cfa.h|  26 +++---
 libgcc/unwind-dw2.c|  19 +++--
 libgcc/unwind-dw2.h|  17 +++-
 4 files changed, 159 insertions(+), 48 deletions(-)

diff --git a/libgcc/config/a

[gcc r15-3802] libgcc: hide CIE and FDE data for DWARF architecture extensions behind a handler.

2024-09-23 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:bdf41d627c13bc5f0dc676991f4513daa9d9ae36

commit r15-3802-gbdf41d627c13bc5f0dc676991f4513daa9d9ae36
Author: Matthieu Longo 
Date:   Mon Sep 23 15:03:37 2024 +0100

libgcc: hide CIE and FDE data for DWARF architecture extensions behind a 
handler.

This patch provides a new handler MD_ARCH_FRAME_STATE_T to hide an
architecture-specific structure containing CIE and FDE data related
to DWARF architecture extensions.

Hiding the architecture-specific attributes behind a handler has the
following benefits:
1. isolating those data from the generic ones in _Unwind_FrameState
2. avoiding casts to custom types.
3. preserving typing information when debugging with GDB, and so
   facilitating their printing.

This approach required to add a new header md-unwind-def.h included at
the top of libgcc/unwind-dw2.h, and redirecting to the corresponding
architecture header via a symbolic link.

An obvious drawback is the increase in complexity with macros, and
headers. It also caused a split of architecture definitions between
md-unwind-def.h (types definitions used in unwind-dw2.h) and
md-unwind.h (local types definitions and handlers implementations).
The naming of md-unwind.h with .h extension is a bit misleading as
the file is only included in the middle of unwind-dw2.c. Changing
this naming would require modification of others backends, which I
prefered to abstain from. Overall the benefits are worth the added
complexity from my perspective.

libgcc/ChangeLog:

* Makefile.in: New target for symbolic link to md-unwind-def.h
* config.host: New parameter md_unwind_def_header. Set it to
aarch64/aarch64-unwind-def.h for AArch64 targets, or no-unwind.h
by default.
* config/aarch64/aarch64-unwind.h
(aarch64_pointer_auth_key): Move to aarch64-unwind-def.h
(aarch64_cie_aug_handler): Update.
(aarch64_arch_extension_frame_init): Update.
(aarch64_demangle_return_addr): Update.
* configure.ac: New substitute variable md_unwind_def_header.
* unwind-dw2.h (defined): MD_ARCH_FRAME_STATE_T.
* config/aarch64/aarch64-unwind-def.h: New file.
* configure: Regenerate.
* config/no-unwind.h: Updated comment

Diff:
---
 libgcc/Makefile.in |  6 -
 libgcc/config.host | 13 --
 libgcc/config/aarch64/aarch64-unwind-def.h | 41 ++
 libgcc/config/aarch64/aarch64-unwind.h | 14 --
 libgcc/config/no-unwind.h  |  3 ++-
 libgcc/configure   |  2 ++
 libgcc/configure.ac|  1 +
 libgcc/unwind-dw2.h|  6 +++--
 8 files changed, 71 insertions(+), 15 deletions(-)

diff --git a/libgcc/Makefile.in b/libgcc/Makefile.in
index 0e46e9ef7686..ffc45f212672 100644
--- a/libgcc/Makefile.in
+++ b/libgcc/Makefile.in
@@ -47,6 +47,7 @@ with_aix_soname = @with_aix_soname@
 solaris_ld_v2_maps = @solaris_ld_v2_maps@
 enable_execute_stack = @enable_execute_stack@
 unwind_header = @unwind_header@
+md_unwind_def_header = @md_unwind_def_header@
 md_unwind_header = @md_unwind_header@
 sfp_machine_header = @sfp_machine_header@
 thread_header = @thread_header@
@@ -358,13 +359,16 @@ SHLIBUNWIND_INSTALL =
 
 
 # Create links to files specified in config.host.
-LIBGCC_LINKS = enable-execute-stack.c unwind.h md-unwind-support.h \
+LIBGCC_LINKS = enable-execute-stack.c \
+   unwind.h md-unwind-def.h md-unwind-support.h \
sfp-machine.h gthr-default.h
 
 enable-execute-stack.c: $(srcdir)/$(enable_execute_stack)
-$(LN_S) $< $@
 unwind.h: $(srcdir)/$(unwind_header)
-$(LN_S) $< $@
+md-unwind-def.h: $(srcdir)/config/$(md_unwind_def_header)
+   -$(LN_S) $< $@
 md-unwind-support.h: $(srcdir)/config/$(md_unwind_header)
-$(LN_S) $< $@
 sfp-machine.h: $(srcdir)/config/$(sfp_machine_header)
diff --git a/libgcc/config.host b/libgcc/config.host
index 4fb4205478a8..5c6b656531ff 100644
--- a/libgcc/config.host
+++ b/libgcc/config.host
@@ -51,8 +51,10 @@
 #  If either is set, EXTRA_PARTS and
 #  EXTRA_MULTILIB_PARTS inherited from the GCC
 #  subdirectory will be ignored.
-#  md_unwind_headerThe name of a header file defining
-#  MD_FALLBACK_FRAME_STATE_FOR.
+#  md_unwind_def_header The name of a header file defining architecture
+#  -specific frame information types for unwinding.
+#  md_unwind_headerThe name of a header file defining architecture
+#  -specific handlers used in the unwinder.
 #  sfp_machine_header  The name of a sfp-machine.h header file for soft-fp.
 #  Defaults to "$cpu_type/sfp-machine.h" if it exists,

[gcc r15-3801] aarch64: skip copy of RA state register into target context

2024-09-23 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:ba3e597681b640f6f9a676ec4f6cd3ca3878cefc

commit r15-3801-gba3e597681b640f6f9a676ec4f6cd3ca3878cefc
Author: Matthieu Longo 
Date:   Mon Sep 23 15:03:35 2024 +0100

aarch64: skip copy of RA state register into target context

The RA state register is local to a frame, so it should not be copied to
the target frame during the context installation.

This patch adds a new backend handler that check whether a register
needs to be skipped or not before its installation.

libgcc/ChangeLog:

* config/aarch64/aarch64-unwind.h
(MD_FRAME_LOCAL_REGISTER_P): new handler checking whether a register
from the current context needs to be skipped before installation 
into
the target context.
(aarch64_frame_local_register): Likewise.
* unwind-dw2.c (uw_install_context_1): use 
MD_FRAME_LOCAL_REGISTER_P.

Diff:
---
 libgcc/config/aarch64/aarch64-unwind.h | 11 +++
 libgcc/unwind-dw2.c|  5 +
 2 files changed, 16 insertions(+)

diff --git a/libgcc/config/aarch64/aarch64-unwind.h 
b/libgcc/config/aarch64/aarch64-unwind.h
index 94ea5891b4eb..52bfd5409798 100644
--- a/libgcc/config/aarch64/aarch64-unwind.h
+++ b/libgcc/config/aarch64/aarch64-unwind.h
@@ -53,6 +53,9 @@ typedef enum {
 #define MD_DEMANGLE_RETURN_ADDR(context, fs, addr) \
   aarch64_demangle_return_addr (context, fs, addr)
 
+#define MD_FRAME_LOCAL_REGISTER_P(reg) \
+  aarch64_frame_local_register (reg)
+
 static inline aarch64_ra_signing_method_t
 aarch64_context_ra_state_get (struct _Unwind_Context *context)
 {
@@ -127,6 +130,14 @@ aarch64_arch_extension_frame_init (struct _Unwind_Context 
*context ATTRIBUTE_UNU
   aarch64_fs_ra_state_set (fs, aarch64_ra_no_signing);
 }
 
+/* Before copying the current context to the target context, check whether
+   the register is local to this context and should not be forwarded.  */
+static inline bool
+aarch64_frame_local_register(long reg)
+{
+  return (reg == AARCH64_DWARF_REGNUM_RA_STATE);
+}
+
 /* Do AArch64 private extraction on ADDR_WORD based on context info CONTEXT and
unwind frame info FS.  If ADDR_WORD is signed, we do address authentication
on it using CFA of current frame.
diff --git a/libgcc/unwind-dw2.c b/libgcc/unwind-dw2.c
index 40d64c0c0a39..5f33f80670ac 100644
--- a/libgcc/unwind-dw2.c
+++ b/libgcc/unwind-dw2.c
@@ -1423,6 +1423,11 @@ uw_install_context_1 (struct _Unwind_Context *current,
   void *c = (void *) (_Unwind_Internal_Ptr) current->reg[i];
   void *t = (void *) (_Unwind_Internal_Ptr)target->reg[i];
 
+#ifdef MD_FRAME_LOCAL_REGISTER_P
+  if (MD_FRAME_LOCAL_REGISTER_P (i))
+   continue;
+#endif
+
   gcc_assert (current->by_value[i] == 0);
   if (target->by_value[i] && c)
{

[gcc r15-3804] dwarf2: add hooks for architecture-specific CFIs

2024-09-23 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:9e1c71bab50d51a1a8ec1a75080ffde6ca3d854c

commit r15-3804-g9e1c71bab50d51a1a8ec1a75080ffde6ca3d854c
Author: Matthieu Longo 
Date:   Mon Sep 23 15:34:57 2024 +0100

dwarf2: add hooks for architecture-specific CFIs

Architecture-specific CFI directives are currently declared an processed
among others architecture-independent CFI directives in gcc/dwarf2* files.
This approach creates confusion, specifically in the case of DWARF
instructions in the vendor space and using the same instruction code.

Such a clash currently happen between DW_CFA_GNU_window_save (used on
SPARC) and DW_CFA_AARCH64_negate_ra_state (used on AArch64), and both
having the same instruction code 0x2d.
Then AArch64 compilers generates a SPARC CFI directive (.cfi_window_save)
instead of .cfi_negate_ra_state, contrarilly to what is expected in
[DWARF for the Arm 64-bit Architecture (AArch64)](https://github.com/
ARM-software/abi-aa/blob/main/aadwarf64/aadwarf64.rst).

This refactoring does not solve completely the problem, but improve the
situation by moving some of the processing of those directives (more
specifically their output in the assembly) to the backend via 2 target
hooks:
- DW_CFI_OPRND1_DESC: parse the first operand of the directive (if any).
- OUTPUT_CFI_DIRECTIVE: output the CFI directive as a string.

Additionally, this patch also contains a renaming of an enum used for
return address mangling on AArch64.

gcc/ChangeLog:

* config/aarch64/aarch64.cc
(aarch64_output_cfi_directive): New hook for CFI directives.
(aarch64_dw_cfi_oprnd1_desc): Same.
(TARGET_OUTPUT_CFI_DIRECTIVE): Hook for output_cfi_directive.
(TARGET_DW_CFI_OPRND1_DESC): Hook for dw_cfi_oprnd1_desc.
* config/sparc/sparc.cc
(sparc_output_cfi_directive): New hook for CFI directives.
(sparc_dw_cfi_oprnd1_desc): Same.
(TARGET_OUTPUT_CFI_DIRECTIVE): Hook for output_cfi_directive.
(TARGET_DW_CFI_OPRND1_DESC): Hook for dw_cfi_oprnd1_desc.
* coretypes.h
(struct dw_cfi_node): Forward declaration of CFI type from
gcc/dwarf2out.h.
(enum dw_cfi_oprnd_type): Same.
(enum dwarf_call_frame_info): Same.
* doc/tm.texi: Regenerated from doc/tm.texi.in.
* doc/tm.texi.in: Add doc for new target hooks.
type of enum to allow forward declaration.
* dwarf2cfi.cc
(struct dw_cfi_row): Update the description for window_save
and ra_mangled.
(dwarf2out_frame_debug_cfa_negate_ra_state): Use AArch64 CFI
directive instead of the SPARC one.
(change_cfi_row): Use the right CFI directive's name for RA
mangling.
(output_cfi): Remove explicit architecture-specific CFI
directive DW_CFA_GNU_window_save that falls into default case.
(output_cfi_directive): Use target hook as default.
* dwarf2out.cc (dw_cfi_oprnd1_desc): Use target hook as default.
* dwarf2out.h (enum dw_cfi_oprnd_type): specify underlying type
of enum to allow forward declaration.
(dw_cfi_oprnd1_desc): Call target hook.
(output_cfi_directive): Use dw_cfi_ref instead of struct
dw_cfi_node *.
* hooks.cc
(hook_bool_dwcfi_dwcfioprndtyperef_false): New.
(hook_bool_FILEptr_dwcfiptr_false): New.
* hooks.h
(hook_bool_dwcfi_dwcfioprndtyperef_false): New.
(hook_bool_FILEptr_dwcfiptr_false): New.
* target.def: Documentation for new hooks.

include/ChangeLog:

* dwarf2.h (enum dwarf_call_frame_info): specify underlying

libffi/ChangeLog:

* include/ffi_cfi.h (cfi_negate_ra_state): Declare AArch64 cfi
directive.

libgcc/ChangeLog:

* config/aarch64/aarch64-asm.h (PACIASP): Replace SPARC CFI
directive by AArch64 one.
(AUTIASP): Same.

libitm/ChangeLog:

* config/aarch64/sjlj.S: Replace SPARC CFI directive by
AArch64 one.

gcc/testsuite/ChangeLog:

* g++.target/aarch64/pr94515-1.C: Replace SPARC CFI directive by
AArch64 one.
* g++.target/aarch64/pr94515-2.C: Same.

Diff:
---
 gcc/config/aarch64/aarch64.cc| 33 ++
 gcc/config/sparc/sparc.cc| 35 
 gcc/coretypes.h  |  6 +
 gcc/doc/tm.texi  | 16 -
 gcc/doc/tm.texi.in   |  5 +++-
 gcc/dwarf2cfi.cc | 31 
 gcc/dwarf2out.cc | 13 +++
 gcc/dwarf2o

[gcc r15-3803] Rename REG_CFA_TOGGLE_RA_MANGLE to REG_CFA_NEGATE_RA_STATE

2024-09-23 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:4068096fbf5aef65883a7492f4940cea85b39f40

commit r15-3803-g4068096fbf5aef65883a7492f4940cea85b39f40
Author: Matthieu Longo 
Date:   Mon Sep 23 15:31:18 2024 +0100

Rename REG_CFA_TOGGLE_RA_MANGLE to REG_CFA_NEGATE_RA_STATE

The current name REG_CFA_TOGGLE_RA_MANGLE is not representative of what
it really is, i.e. a register to represent several states, not only a
binary one. Same for dwarf2out_frame_debug_cfa_toggle_ra_mangle.

gcc/ChangeLog:

* combine-stack-adj.cc
(no_unhandled_cfa): Rename.
* config/aarch64/aarch64.cc
(aarch64_expand_prologue): Rename.
(aarch64_expand_epilogue): Rename.
* dwarf2cfi.cc
(dwarf2out_frame_debug_cfa_toggle_ra_mangle): Rename this...
(dwarf2out_frame_debug_cfa_negate_ra_state): To this.
(dwarf2out_frame_debug): Rename.
* reg-notes.def (REG_CFA_NOTE): Rename REG_CFA_TOGGLE_RA_MANGLE.

Diff:
---
 gcc/combine-stack-adj.cc  | 2 +-
 gcc/config/aarch64/aarch64.cc | 4 ++--
 gcc/dwarf2cfi.cc  | 8 
 gcc/reg-notes.def | 8 
 4 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/gcc/combine-stack-adj.cc b/gcc/combine-stack-adj.cc
index 2da9bf2bc1ef..367d3b66b749 100644
--- a/gcc/combine-stack-adj.cc
+++ b/gcc/combine-stack-adj.cc
@@ -212,7 +212,7 @@ no_unhandled_cfa (rtx_insn *insn)
   case REG_CFA_SET_VDRAP:
   case REG_CFA_WINDOW_SAVE:
   case REG_CFA_FLUSH_QUEUE:
-  case REG_CFA_TOGGLE_RA_MANGLE:
+  case REG_CFA_NEGATE_RA_STATE:
return false;
   }
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 68913beaee20..e41431d56ac4 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -9612,7 +9612,7 @@ aarch64_expand_prologue (void)
  default:
gcc_unreachable ();
}
-  add_reg_note (insn, REG_CFA_TOGGLE_RA_MANGLE, const0_rtx);
+  add_reg_note (insn, REG_CFA_NEGATE_RA_STATE, const0_rtx);
   RTX_FRAME_RELATED_P (insn) = 1;
 }
 
@@ -10033,7 +10033,7 @@ aarch64_expand_epilogue (rtx_call_insn *sibcall)
  default:
gcc_unreachable ();
}
-  add_reg_note (insn, REG_CFA_TOGGLE_RA_MANGLE, const0_rtx);
+  add_reg_note (insn, REG_CFA_NEGATE_RA_STATE, const0_rtx);
   RTX_FRAME_RELATED_P (insn) = 1;
 }
 
diff --git a/gcc/dwarf2cfi.cc b/gcc/dwarf2cfi.cc
index 1231b5bb5f05..4ad9acbd6fd6 100644
--- a/gcc/dwarf2cfi.cc
+++ b/gcc/dwarf2cfi.cc
@@ -1547,13 +1547,13 @@ dwarf2out_frame_debug_cfa_window_save (void)
   cur_row->window_save = true;
 }
 
-/* A subroutine of dwarf2out_frame_debug, process a REG_CFA_TOGGLE_RA_MANGLE.
+/* A subroutine of dwarf2out_frame_debug, process a REG_CFA_NEGATE_RA_STATE.
Note: DW_CFA_GNU_window_save dwarf opcode is reused for toggling RA mangle
state, this is a target specific operation on AArch64 and can only be used
on other targets if they don't use the window save operation otherwise.  */
 
 static void
-dwarf2out_frame_debug_cfa_toggle_ra_mangle (void)
+dwarf2out_frame_debug_cfa_negate_ra_state (void)
 {
   dw_cfi_ref cfi = new_cfi ();
 
@@ -2341,8 +2341,8 @@ dwarf2out_frame_debug (rtx_insn *insn)
handled_one = true;
break;
 
-  case REG_CFA_TOGGLE_RA_MANGLE:
-   dwarf2out_frame_debug_cfa_toggle_ra_mangle ();
+  case REG_CFA_NEGATE_RA_STATE:
+   dwarf2out_frame_debug_cfa_negate_ra_state ();
handled_one = true;
break;
 
diff --git a/gcc/reg-notes.def b/gcc/reg-notes.def
index 5b878fb2a1cd..ddcf16b68be5 100644
--- a/gcc/reg-notes.def
+++ b/gcc/reg-notes.def
@@ -180,10 +180,10 @@ REG_CFA_NOTE (CFA_WINDOW_SAVE)
the rest of the compiler as a CALL_INSN.  */
 REG_CFA_NOTE (CFA_FLUSH_QUEUE)
 
-/* Attached to insns that are RTX_FRAME_RELATED_P, toggling the mangling status
-   of return address.  Currently it's only used by AArch64.  The argument is
-   ignored.  */
-REG_CFA_NOTE (CFA_TOGGLE_RA_MANGLE)
+/* Attached to insns that are RTX_FRAME_RELATED_P, indicating an authentication
+   of the return address. Currently it's only used by AArch64.
+   The argument is ignored.  */
+REG_CFA_NOTE (CFA_NEGATE_RA_STATE)
 
 /* Indicates what exception region an INSN belongs in.  This is used
to indicate what region to which a call may throw.  REGION 0

[gcc r15-3805] aarch64 testsuite: explain expectections for pr94515* tests

2024-09-23 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:fb475d3f25943beffac8e9c0c78247bad75287a1

commit r15-3805-gfb475d3f25943beffac8e9c0c78247bad75287a1
Author: Matthieu Longo 
Date:   Mon Sep 23 15:35:02 2024 +0100

aarch64 testsuite: explain expectections for pr94515* tests

gcc/testsuite/ChangeLog:

* g++.target/aarch64/pr94515-1.C: Improve test documentation.
* g++.target/aarch64/pr94515-2.C: Same.

Diff:
---
 gcc/testsuite/g++.target/aarch64/pr94515-1.C |  8 ++
 gcc/testsuite/g++.target/aarch64/pr94515-2.C | 39 +++-
 2 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/gcc/testsuite/g++.target/aarch64/pr94515-1.C 
b/gcc/testsuite/g++.target/aarch64/pr94515-1.C
index d5c114a83a82..359039e17536 100644
--- a/gcc/testsuite/g++.target/aarch64/pr94515-1.C
+++ b/gcc/testsuite/g++.target/aarch64/pr94515-1.C
@@ -15,12 +15,20 @@ void unwind (void)
 __attribute__((noinline, noipa, target("branch-protection=pac-ret")))
 int test (int z)
 {
+  // paciasp -> cfi_negate_ra_state: RA_no_signing -> RA_signing_SP
   if (z) {
 asm volatile ("":::"x20","x21");
 unwind ();
+// autiasp -> cfi_negate_ra_state: RA_signing_SP -> RA_no_signing
 return 1;
   } else {
+// 2nd cfi_negate_ra_state because the CFI directives are processed 
linearily.
+// At this point, the unwinder would believe that the address is not signed
+// due to the previous return. That's why the compiler has to emit second
+// cfi_negate_ra_state to mean that the return address is still signed.
+// cfi_negate_ra_state: RA_no_signing -> RA_signing_SP
 unwind ();
+// autiasp -> cfi_negate_ra_state: RA_signing_SP -> RA_no_signing
 return 2;
   }
 }
diff --git a/gcc/testsuite/g++.target/aarch64/pr94515-2.C 
b/gcc/testsuite/g++.target/aarch64/pr94515-2.C
index f4abeed4..bdb65411a080 100644
--- a/gcc/testsuite/g++.target/aarch64/pr94515-2.C
+++ b/gcc/testsuite/g++.target/aarch64/pr94515-2.C
@@ -6,6 +6,7 @@
 volatile int zero = 0;
 int global = 0;
 
+/* This is a leaf function, so no .cfi_negate_ra_state directive is expected.  
*/
 __attribute__((noinline))
 int bar(void)
 {
@@ -13,29 +14,55 @@ int bar(void)
   return 0;
 }
 
+/* This function does not return normally, so the address is signed but no
+ * authentication code is emitted. It means that only one CFI directive is
+ * supposed to be emitted at signing time.  */
 __attribute__((noinline, noreturn))
 void unwind (void)
 {
   throw 42;
 }
 
+/* This function has several return instructions, and alternates different RA
+ * states. 4 .cfi_negate_ra_state and a .cfi_remember_state/.cfi_restore_state
+ * should be emitted.
+ *
+ * Expected layout:
+ *   A: path to return 0 without assignment to global
+ *   B: global=y + branch back into A
+ *   C: return 2
+ *   D: unwind
+ * Which gives with return pointer authentication:
+ *   A: sign -> authenticate [2 negate_ra_states + remember_state for B]
+ *   B: signed [restore_state]
+ *   C: unsigned [negate_ra_state]
+ *   D: signed [negate_ra_state]
+ */
 __attribute__((noinline, noipa))
 int test(int x)
 {
-  if (x==1) return 2; /* This return path may not use the stack.  */
+  // This return path may not use the stack. This means that the return address
+  // won't be signed.
+  if (x==1) return 2;
+
+  // All the return paths of the code below must have RA mangle state set, and
+  // the return address must be signed.
   int y = bar();
   if (y > global) global=y;
-  if (y==3) unwind(); /* This return path must have RA mangle state set.  */
-  return 0;
+  if (y==3) unwind(); // authentication of the return address is not required.
+  return 0; // authentication of the return address is required.
 }
 
+/* This function requires only 2 .cfi_negate_ra_state.  */
 int main ()
 {
+  // paciasp -> cfi_negate_ra_state: RA_no_signing -> RA_signing_SP
   try {
 test (zero);
-__builtin_abort ();
+__builtin_abort (); // authentication of the return address is not 
required.
   } catch (...) {
+// autiasp -> cfi_negate_ra_state: RA_signing_SP -> RA_no_signing
 return 0;
   }
-  __builtin_abort ();
-}
+  __builtin_abort (); // authentication of the return address is not required.
+}
\ No newline at end of file

[gcc r15-3806] dwarf2: store the RA state in CFI row

2024-09-23 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:2b7971448f122317ed012586f9f73ccc0537deb2

commit r15-3806-g2b7971448f122317ed012586f9f73ccc0537deb2
Author: Matthieu Longo 
Date:   Mon Sep 23 15:35:07 2024 +0100

dwarf2: store the RA state in CFI row

On AArch64, the RA state informs the unwinder whether the return address
is mangled and how, or not. This information is encoded in a boolean in
the CFI row. This binary approach prevents from expressing more complex
configuration, as it is the case with PAuth_LR introduced in Armv9.5-A.

This patch addresses this limitation by replacing the boolean by an enum.

gcc/ChangeLog:

* dwarf2cfi.cc
(struct dw_cfi_row): Declare a new enum type to replace ra_mangled.
(cfi_row_equal_p): Use ra_state instead of ra_mangled.
(dwarf2out_frame_debug_cfa_negate_ra_state): Same.
(change_cfi_row): Same.

Diff:
---
 gcc/dwarf2cfi.cc | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/gcc/dwarf2cfi.cc b/gcc/dwarf2cfi.cc
index f8d19d524299..1b94185a4966 100644
--- a/gcc/dwarf2cfi.cc
+++ b/gcc/dwarf2cfi.cc
@@ -57,6 +57,15 @@ along with GCC; see the file COPYING3.  If not see
 #define DEFAULT_INCOMING_FRAME_SP_OFFSET INCOMING_FRAME_SP_OFFSET
 #endif
 
+
+/* Signing method used for return address authentication.
+   (AArch64 extension)  */
+typedef enum
+{
+  ra_no_signing = 0x0,
+  ra_signing_sp = 0x1,
+} ra_signing_method_t;
+
 /* A collected description of an entire row of the abstract CFI table.  */
 struct GTY(()) dw_cfi_row
 {
@@ -74,8 +83,8 @@ struct GTY(()) dw_cfi_row
   bool window_save;
 
   /* AArch64 extension for DW_CFA_AARCH64_negate_ra_state.
- True if the return address is in a mangled state.  */
-  bool ra_mangled;
+ Enum which stores the return address state.  */
+  ra_signing_method_t ra_state;
 };
 
 /* The caller's ORIG_REG is saved in SAVED_IN_REG.  */
@@ -857,7 +866,7 @@ cfi_row_equal_p (dw_cfi_row *a, dw_cfi_row *b)
   if (a->window_save != b->window_save)
 return false;
 
-  if (a->ra_mangled != b->ra_mangled)
+  if (a->ra_state != b->ra_state)
 return false;
 
   return true;
@@ -1554,8 +1563,11 @@ dwarf2out_frame_debug_cfa_negate_ra_state (void)
 {
   dw_cfi_ref cfi = new_cfi ();
   cfi->dw_cfi_opc = DW_CFA_AARCH64_negate_ra_state;
+  cur_row->ra_state
+= (cur_row->ra_state == ra_no_signing
+  ? ra_signing_sp
+  : ra_no_signing);
   add_cfi (cfi);
-  cur_row->ra_mangled = !cur_row->ra_mangled;
 }
 
 /* Record call frame debugging information for an expression EXPR,
@@ -2412,12 +2424,12 @@ change_cfi_row (dw_cfi_row *old_row, dw_cfi_row 
*new_row)
 {
   dw_cfi_ref cfi = new_cfi ();
 
-  gcc_assert (!old_row->ra_mangled && !new_row->ra_mangled);
+  gcc_assert (!old_row->ra_state && !new_row->ra_state);
   cfi->dw_cfi_opc = DW_CFA_GNU_window_save;
   add_cfi (cfi);
 }
 
-  if (old_row->ra_mangled != new_row->ra_mangled)
+  if (old_row->ra_state != new_row->ra_state)
 {
   dw_cfi_ref cfi = new_cfi ();

[gcc r15-3738] testsuite: Update commandline for PR116628.c to use neoverse-v2 [PR116628]

2024-09-20 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:0189ab205aa86b8e67ae982294f0fe58aa9c4774

commit r15-3738-g0189ab205aa86b8e67ae982294f0fe58aa9c4774
Author: Tamar Christina 
Date:   Fri Sep 20 17:01:39 2024 +0100

testsuite: Update commandline for PR116628.c to use neoverse-v2 [PR116628]

The testcase for this tests needs Neoverse V2 to be used
since due to costing the other cost models don't pick this
particular SVE mode.

committed as obvious.

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR tree-optimization/116628
* gcc.dg/vect/pr116628.c: Update cmdline.

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr116628.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr116628.c 
b/gcc/testsuite/gcc.dg/vect/pr116628.c
index 4068c657ac55..a38ffb33365a 100644
--- a/gcc/testsuite/gcc.dg/vect/pr116628.c
+++ b/gcc/testsuite/gcc.dg/vect/pr116628.c
@@ -1,7 +1,7 @@
 /* { dg-do compile } */
 /* { dg-require-effective-target vect_float } */
 /* { dg-require-effective-target vect_masked_store } */
-/* { dg-additional-options "-Ofast -march=armv9-a" { target aarch64-*-* } } */
+/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" { target aarch64-*-* } } 
*/
 
 typedef float c;
 c a[2000], b[0];

[gcc r15-3739] AArch64: Define VECTOR_STORE_FLAG_VALUE.

2024-09-20 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:33cb400b2e7266e65030869254366217e51494aa

commit r15-3739-g33cb400b2e7266e65030869254366217e51494aa
Author: Tamar Christina 
Date:   Fri Sep 20 17:03:54 2024 +0100

AArch64: Define VECTOR_STORE_FLAG_VALUE.

This defines VECTOR_STORE_FLAG_VALUE to CONST1_RTX for AArch64
so we simplify vector comparisons in AArch64.

With this enabled

res:
moviv0.4s, 0
cmeqv0.4s, v0.4s, v0.4s
ret

is simplified to:

res:
mvniv0.4s, 0
ret

gcc/ChangeLog:

* config/aarch64/aarch64.h (VECTOR_STORE_FLAG_VALUE): New.

gcc/testsuite/ChangeLog:

* gcc.dg/rtl/aarch64/vector-eq.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64.h | 10 ++
 gcc/testsuite/gcc.dg/rtl/aarch64/vector-eq.c | 29 
 2 files changed, 39 insertions(+)

diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 2dfb999bea53..a99e7bb6c477 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -156,6 +156,16 @@
 
 #define PCC_BITFIELD_TYPE_MATTERS  1
 
+/* Use the same RTL truth representation for vector elements as we do
+   for scalars.  This maintains the property that a comparison like
+   eq:V4SI is a composition of 4 individual eq:SIs, just like plus:V4SI
+   is a composition of 4 individual plus:SIs.
+
+   This means that Advanced SIMD comparisons are represented in RTL as
+   (neg (op ...)).  */
+
+#define VECTOR_STORE_FLAG_VALUE(MODE) CONST1_RTX (GET_MODE_INNER (MODE))
+
 #ifndef USED_FOR_TARGET
 
 /* Define an enum of all features (ISA modes, architectures and extensions).
diff --git a/gcc/testsuite/gcc.dg/rtl/aarch64/vector-eq.c 
b/gcc/testsuite/gcc.dg/rtl/aarch64/vector-eq.c
new file mode 100644
index ..8e0d7773620c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/rtl/aarch64/vector-eq.c
@@ -0,0 +1,29 @@
+/* { dg-do compile { target aarch64-*-* } } */
+/* { dg-additional-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+/*
+** foo:
+** mvniv0.4s, 0
+** ret
+*/
+__Uint32x4_t __RTL (startwith ("vregs")) foo (void)
+{
+(function "foo"
+  (insn-chain
+(block 2
+  (edge-from entry (flags "FALLTHRU"))
+  (cnote 1 [bb 2] NOTE_INSN_BASIC_BLOCK)
+  (cnote 2 NOTE_INSN_FUNCTION_BEG)
+  (cinsn 3 (set (reg:V4SI <0>) (const_vector:V4SI [(const_int 0) 
(const_int 0) (const_int 0) (const_int 0)])))
+  (cinsn 4 (set (reg:V4SI <1>) (reg:V4SI <0>)))
+  (cinsn 5 (set (reg:V4SI <2>)
+   (neg:V4SI (eq:V4SI (reg:V4SI <0>) (reg:V4SI <1>)
+  (cinsn 6 (set (reg:V4SI v0) (reg:V4SI <2>)))
+  (edge-to exit (flags "FALLTHRU"))
+)
+  )
+  (crtl (return_rtx (reg/i:V4SI v0)))
+)
+}
+

[gcc r15-3959] middle-end: check explicitly for external or constants when checking for loop invariant [PR116817]

2024-09-30 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:87905f63a6521eef1f38082e2368e18c637ef092

commit r15-3959-g87905f63a6521eef1f38082e2368e18c637ef092
Author: Tamar Christina 
Date:   Mon Sep 30 13:06:24 2024 +0100

middle-end: check explicitly for external or constants when checking for 
loop invariant [PR116817]

The previous check if a value was external was checking
!vect_get_internal_def (vinfo, var) but this of course isn't completely 
right
as they could reductions etc.

This changes the check to just explicitly look at externals and constants.
Note that reductions remain unhandled here, but we don't support codegen of
boolean reductions today anyway.

So at the time we do then this would have the be handled as well in 
lowering.

gcc/ChangeLog:

PR tree-optimization/116817
* tree-vect-patterns.cc (vect_recog_bool_pattern): Check for const 
or
externals.

gcc/testsuite/ChangeLog:

PR tree-optimization/116817
* g++.dg/vect/pr116817.cc: New test.

Diff:
---
 gcc/testsuite/g++.dg/vect/pr116817.cc | 16 
 gcc/tree-vect-patterns.cc |  5 -
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/g++.dg/vect/pr116817.cc 
b/gcc/testsuite/g++.dg/vect/pr116817.cc
new file mode 100644
index ..7e28982fb138
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/pr116817.cc
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3" } */
+
+int main_ulData0;
+unsigned *main_pSrcBuffer;
+int main(void) {
+  int iSrc = 0;
+  bool bData0;
+  for (; iSrc < 4; iSrc++) {
+if (bData0)
+  main_pSrcBuffer[iSrc] = main_ulData0;
+else
+  main_pSrcBuffer[iSrc] = 0;
+bData0 = !bData0;
+  }
+}
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index e7e877dd2adb..b174ff1e705c 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -6062,12 +6062,15 @@ vect_recog_bool_pattern (vec_info *vinfo,
   if (get_vectype_for_scalar_type (vinfo, type) == NULL_TREE)
return NULL;
 
+  enum vect_def_type dt;
   if (check_bool_pattern (var, vinfo, bool_stmts))
var = adjust_bool_stmts (vinfo, bool_stmts, type, stmt_vinfo);
   else if (integer_type_for_mask (var, vinfo))
return NULL;
   else if (TREE_CODE (TREE_TYPE (var)) == BOOLEAN_TYPE
-  && !vect_get_internal_def (vinfo, var))
+  && vect_is_simple_use (var, vinfo, &dt)
+  && (dt == vect_external_def
+  || dt == vect_constant_def))
{
  /* If the condition is already a boolean then manually convert it to a
 mask of the given integer type but don't set a vectype.  */

[gcc r14-10893] AArch64: rename the SVE2 psel intrinsics to psel_lane [PR116371]

2024-11-06 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:97640e9632697b9f0ab31e4022d24d360d1ea2c9

commit r14-10893-g97640e9632697b9f0ab31e4022d24d360d1ea2c9
Author: Tamar Christina 
Date:   Mon Oct 14 13:58:09 2024 +0100

AArch64: rename the SVE2 psel intrinsics to psel_lane [PR116371]

The psel intrinsics. similar to the pext, should be name psel_lane.  This
corrects the naming.

gcc/ChangeLog:

PR target/116371
* config/aarch64/aarch64-sve-builtins-sve2.cc (class svpsel_impl):
Renamed to ...
(class svpsel_lane_impl): ... This and adjust initialization.
* config/aarch64/aarch64-sve-builtins-sve2.def (svpsel): Renamed to 
...
(svpsel_lane): ... This.
* config/aarch64/aarch64-sve-builtins-sve2.h (svpsel): Renamed to
svpsel_lane.

gcc/testsuite/ChangeLog:

PR target/116371
* gcc.target/aarch64/sme2/acle-asm/psel_b16.c,
gcc.target/aarch64/sme2/acle-asm/psel_b32.c,
gcc.target/aarch64/sme2/acle-asm/psel_b64.c,
gcc.target/aarch64/sme2/acle-asm/psel_b8.c,
gcc.target/aarch64/sme2/acle-asm/psel_c16.c,
gcc.target/aarch64/sme2/acle-asm/psel_c32.c,
gcc.target/aarch64/sme2/acle-asm/psel_c64.c,
gcc.target/aarch64/sme2/acle-asm/psel_c8.c: Renamed to
* gcc.target/aarch64/sme2/acle-asm/psel_lane_b16.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_b32.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_b64.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_b8.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_c16.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_c32.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_c64.c,
gcc.target/aarch64/sme2/acle-asm/psel_lane_c8.c: ... These.

(cherry picked from commit 306834b7f74ab61160f205e04f5bf35b71f9ec52)

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-sve2.cc|  4 +-
 gcc/config/aarch64/aarch64-sve-builtins-sve2.def   |  2 +-
 gcc/config/aarch64/aarch64-sve-builtins-sve2.h |  2 +-
 .../gcc.target/aarch64/sme2/acle-asm/psel_b16.c| 89 --
 .../gcc.target/aarch64/sme2/acle-asm/psel_b32.c| 89 --
 .../gcc.target/aarch64/sme2/acle-asm/psel_b64.c| 80 ---
 .../gcc.target/aarch64/sme2/acle-asm/psel_b8.c | 89 --
 .../gcc.target/aarch64/sme2/acle-asm/psel_c16.c| 89 --
 .../gcc.target/aarch64/sme2/acle-asm/psel_c32.c| 89 --
 .../gcc.target/aarch64/sme2/acle-asm/psel_c64.c| 80 ---
 .../gcc.target/aarch64/sme2/acle-asm/psel_c8.c | 89 --
 .../aarch64/sme2/acle-asm/psel_lane_b16.c  | 89 ++
 .../aarch64/sme2/acle-asm/psel_lane_b32.c  | 89 ++
 .../aarch64/sme2/acle-asm/psel_lane_b64.c  | 80 +++
 .../aarch64/sme2/acle-asm/psel_lane_b8.c   | 89 ++
 .../aarch64/sme2/acle-asm/psel_lane_c16.c  | 89 ++
 .../aarch64/sme2/acle-asm/psel_lane_c32.c  | 89 ++
 .../aarch64/sme2/acle-asm/psel_lane_c64.c  | 80 +++
 .../aarch64/sme2/acle-asm/psel_lane_c8.c   | 89 ++
 19 files changed, 698 insertions(+), 698 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
index 4f25cc680282..06d4d22fc0b2 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
@@ -234,7 +234,7 @@ public:
   }
 };
 
-class svpsel_impl : public function_base
+class svpsel_lane_impl : public function_base
 {
 public:
   rtx
@@ -625,7 +625,7 @@ FUNCTION (svpmullb, unspec_based_function, (-1, 
UNSPEC_PMULLB, -1))
 FUNCTION (svpmullb_pair, unspec_based_function, (-1, UNSPEC_PMULLB_PAIR, -1))
 FUNCTION (svpmullt, unspec_based_function, (-1, UNSPEC_PMULLT, -1))
 FUNCTION (svpmullt_pair, unspec_based_function, (-1, UNSPEC_PMULLT_PAIR, -1))
-FUNCTION (svpsel, svpsel_impl,)
+FUNCTION (svpsel_lane, svpsel_lane_impl,)
 FUNCTION (svqabs, rtx_code_function, (SS_ABS, UNKNOWN, UNKNOWN))
 FUNCTION (svqcadd, svqcadd_impl,)
 FUNCTION (svqcvt, integer_conversion, (UNSPEC_SQCVT, UNSPEC_SQCVTU,
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.def 
b/gcc/config/aarch64/aarch64-sve-builtins-sve2.def
index 4366925a9711..ef677a74020b 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.def
+++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.def
@@ -235,7 +235,7 @@ DEF_SVE_FUNCTION (svsm4ekey, binary, s_unsigned, none)
 | AARCH64_FL_SME \
 | AARCH64_FL_SM_ON)
 DEF_SVE_FUNCTION (svclamp, clamp, all_integer, none)
-DEF_SVE_FUNCTION (svpsel, select_pred, all_pred_count, none)
+DEF_SVE_FU

[gcc r15-5791] AArch64: Suppress default options when march or mcpu used is not affected by it.

2024-11-29 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:5b0e4ed3081e6648460661ff5013e9f03e318505

commit r15-5791-g5b0e4ed3081e6648460661ff5013e9f03e318505
Author: Tamar Christina 
Date:   Fri Nov 29 13:01:11 2024 +

AArch64: Suppress default options when march or mcpu used is not affected 
by it.

This patch makes it so that when you use any of the Cortex-A53 errata
workarounds but have specified an -march or -mcpu we know is not affected 
by it
that we suppress the errata workaround.

This is a driver only patch as the linker invocation needs to be changed as
well.  The linker and cc SPECs are different because for the linker we 
didn't
seem to add an inversion flag for the option.  That said, it's also not 
possible
to configure the linker with it on by default.  So not passing the flag is
sufficient to turn it off.

For the compilers however we have an inversion flag using -mno-, which is 
needed
to disable the workarounds when the compiler has been configured with it by
default.

In case it's unclear how the patch does what it does (it took me a while to
figure out the syntax):

  * Early matching will replace any -march=native or -mcpu=native with their
expanded forms and erases the native arguments from the buffer.
  * Due to the above if we ensure we handle the new code after this erasure 
then
we only have to handle the expanded form.
  * The expanded form needs to handle -march=+extensions and
-mcpu=+extensions and so we can't use normal string matching but
instead use strstr with a custom driver function that's common between
native and non-native builds.
  * For the compilers we output -mno- and for the linker we just
  erase the --fix- option.
  * The extra internal matching, e.g. the duplicate match of mcpu inside:
  mcpu=*:%{%:is_local_not_armv8_base(%{mcpu=*:%*}) is so we can extract the 
glob
  using %* because the outer match would otherwise reset at the %{.  The 
reason
  for the outer glob at all is to skip the block early if no matches are 
found.

The workaround has the effect of suppressing certain inlining and 
multiply-add
formation which leads to about ~1% SPECCPU 2017 Intrate regression on modern
cores.  This patch is needed because most distros configure GCC with the
workaround enabled by default.

Expected output:

> gcc -mcpu=neoverse-v1 -mfix-cortex-a53-835769 -xc - -O3 -o - < /dev/null 
-### 2>&1 | grep "\-mfix" | wc -l
0

> gcc -mfix-cortex-a53-835769 -xc - -O3 -o - < /dev/null -### 2>&1 | grep 
"\-mfix" | wc -l
5

> gcc -mfix-cortex-a53-835769 -march=armv8-a -xc - -O3 -o - < /dev/null 
-### 2>&1 | grep "\-mfix" | wc -l
5

> gcc -mfix-cortex-a53-835769 -march=armv8.1-a -xc - -O3 -o - < /dev/null 
-### 2>&1 | grep "\-mfix" | wc -l
0

> gcc -mfix-cortex-a53-835769 -march=armv8.1-a -xc - -O3 -o - < /dev/null 
-### 2>&1 | grep "\-\-fix" | wc -l
0

> gcc -mfix-cortex-a53-835769 -march=armv8-a -xc - -O3 -o - < /dev/null 
-### 2>&1 | grep "\-\-fix" | wc -l
1

> -gcc -mfix-cortex-a53-835769 -xc - -O3 -o - < /dev/null -### 2>&1 | grep 
"\-\-fix" | wc -l
1

gcc/ChangeLog:

* config/aarch64/aarch64-errata.h (TARGET_SUPPRESS_OPT_SPEC,
TARGET_TURN_OFF_OPT_SPEC, CA53_ERR_835769_COMPILE_SPEC,
CA53_ERR_843419_COMPILE_SPEC): New.
(CA53_ERR_835769_SPEC, CA53_ERR_843419_SPEC): Use them.
* config/aarch64/aarch64-elf-raw.h (CC1_SPEC, CC1PLUS_SPEC): Add
AARCH64_ERRATA_COMPILE_SPEC.
* config/aarch64/aarch64-freebsd.h (CC1_SPEC, CC1PLUS_SPEC): 
Likewise.
* config/aarch64/aarch64-gnu.h (CC1_SPEC, CC1PLUS_SPEC): Likewise.
* config/aarch64/aarch64-linux.h (CC1_SPEC, CC1PLUS_SPEC): Likewise.
* config/aarch64/aarch64-netbsd.h (CC1_SPEC, CC1PLUS_SPEC): 
Likewise.
* common/config/aarch64/aarch64-common.cc
(is_host_cpu_not_armv8_base): New.
* config/aarch64/driver-aarch64.cc: Remove extra newline
* config/aarch64/aarch64.h (is_host_cpu_not_armv8_base): New.
(MCPU_TO_MARCH_SPEC_FUNCTIONS): Add is_local_not_armv8_base.
(EXTRA_SPEC_FUNCTIONS): Add is_local_cpu_armv8_base.
* doc/invoke.texi: Document it.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/cpunative/info_30: New test.
* gcc.target/aarch64/cpunative/info_31: New test.
* gcc.target/aarch64/cpunative/info_32: New test.
* gcc.target/aarch64/cpunative/info_33: New test.
* gcc.target/aarch64/cpunative/native_cpu_30.c: New test.
* gcc.target/aarch64/cpunative/native_cpu_31.c: New test.
* gcc.target/aarch64/cpunative/native_cpu_32.c: New test.
* gcc.target/aarch64/cpunative/native_cpu_33.c: New test.

[gcc r15-5585] middle-end:For multiplication try swapping operands when matching complex multiply [PR116463]

2024-11-22 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:a9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8

commit r15-5585-ga9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8
Author: Tamar Christina 
Date:   Thu Nov 21 15:10:24 2024 +

middle-end:For multiplication try swapping operands when matching complex 
multiply [PR116463]

This commit fixes the failures of complex.exp=fast-math-complex-mls-*.c on 
the
GCC 14 branch and some of the ones on the master.

The current matching just looks for one order for multiplication and was 
relying
on canonicalization to always give the right order because of the 
TWO_OPERANDS.

However when it comes to the multiplication trying only one order is a bit
fragile as they can be flipped.

The failing tests on the branch are:

void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
   _Complex TYPE c[restrict N]) {
  for (int i = 0; i < N; i++)
c[i] -= a[i] * (b[i] * I * I);
}

void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
   _Complex TYPE c[restrict N]) {
  for (int i = 0; i < N; i++)
c[i] -= (a[i] * I * I) * b[i];
}

The issue is just a small difference in commutative operations.
we look for {R,R} * {R,I} but found {R,I} * {R,R}.

Since the DF analysis is cached, we should be able to swap operands and 
retry
for multiply cheaply.

There is a constraint being checked by vect_validate_multiplication for the 
data
flow of the operands feeding the multiplications.  So e.g.

between the nodes:

note:   node 0x4d1d210 (max_nunits=2, refcnt=3) vector(2) double
note:   op template: _27 = _10 * _25;
note:  stmt 0 _27 = _10 * _25;
note:  stmt 1 _29 = _11 * _25;
note:   node 0x4d1d060 (max_nunits=2, refcnt=2) vector(2) double
note:   op template: _26 = _11 * _24;
note:  stmt 0 _26 = _11 * _24;
note:  stmt 1 _28 = _10 * _24;

we require the lanes to come from the same source which
vect_validate_multiplication checks.  As such it doesn't make sense to flip 
them
individually because that would invalidate the earlier linear_loads_p checks
which have validated that the arguments all come from the same datarefs.

This patch thus flips the operands in unison to still maintain this 
invariant,
but also honor the commutative nature of multiplication.

gcc/ChangeLog:

PR tree-optimization/116463
* tree-vect-slp-patterns.cc (complex_mul_pattern::matches,
complex_fms_pattern::matches): Try swapping operands on multiply.

Diff:
---
 gcc/tree-vect-slp-patterns.cc | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-vect-slp-patterns.cc b/gcc/tree-vect-slp-patterns.cc
index d62682be43c9..2535d46db3e8 100644
--- a/gcc/tree-vect-slp-patterns.cc
+++ b/gcc/tree-vect-slp-patterns.cc
@@ -1076,7 +1076,15 @@ complex_mul_pattern::matches (complex_operation_t op,
   enum _conj_status status;
   if (!vect_validate_multiplication (perm_cache, compat_cache, left_op,
 right_op, false, &status))
-return IFN_LAST;
+{
+  /* Try swapping the order and re-trying since multiplication is
+commutative.  */
+  std::swap (left_op[0], left_op[1]);
+  std::swap (right_op[0], right_op[1]);
+  if (!vect_validate_multiplication (perm_cache, compat_cache, left_op,
+right_op, false, &status))
+   return IFN_LAST;
+}
 
   if (status == CONJ_NONE)
 {
@@ -1293,7 +1301,15 @@ complex_fms_pattern::matches (complex_operation_t op,
   enum _conj_status status;
   if (!vect_validate_multiplication (perm_cache, compat_cache, right_op,
 left_op, true, &status))
-return IFN_LAST;
+{
+  /* Try swapping the order and re-trying since multiplication is
+commutative.  */
+  std::swap (left_op[0], left_op[1]);
+  std::swap (right_op[0], right_op[1]);
+  if (!vect_validate_multiplication (perm_cache, compat_cache, right_op,
+left_op, true, &status))
+   return IFN_LAST;
+}
 
   if (status == CONJ_NONE)
 ifn = IFN_COMPLEX_FMS;

[gcc r15-5745] middle-end: rework vectorizable_store to iterate over single index [PR117557]

2024-11-28 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:1b3bff737b2d5a7dc0d5977b77200c785fc53f01

commit r15-5745-g1b3bff737b2d5a7dc0d5977b77200c785fc53f01
Author: Tamar Christina 
Date:   Thu Nov 28 10:23:14 2024 +

middle-end: rework vectorizable_store to iterate over single index 
[PR117557]

The testcase

#include 
#include 

#define N 8
#define L 8

void f(const uint8_t * restrict seq1,
   const uint8_t *idx, uint8_t *seq_out) {
  for (int i = 0; i < L; ++i) {
uint8_t h = idx[i];
memcpy((void *)&seq_out[i * N], (const void *)&seq1[h * N / 2], N / 2);
  }
}

compiled at -O3 -mcpu=neoverse-n1+sve

miscompiles to:

ld1wz31.s, p3/z, [x23, z29.s, sxtw]
ld1wz29.s, p7/z, [x23, z30.s, sxtw]
st1wz29.s, p7, [x24, z12.s, sxtw]
st1wz31.s, p7, [x24, z12.s, sxtw]

rather than

ld1wz31.s, p3/z, [x23, z29.s, sxtw]
ld1wz29.s, p7/z, [x23, z30.s, sxtw]
st1wz29.s, p7, [x24, z12.s, sxtw]
addvl   x3, x24, #2
st1wz31.s, p3, [x3, z12.s, sxtw]

Where two things go wrong, the wrong mask is used and the address pointers 
to
the stores are wrong.

This issue is happening because the codegen loop in vectorizable_store is a
nested loop where in the outer loop we iterate over ncopies and in the inner
loop we loop over vec_num.

For SLP ncopies == 1 and vec_num == SLP_NUM_STMS, but the loop mask is
determined by only the outerloop index and the pointer address is only 
updated
in the outer loop.

As such for SLP we always use the same predicate and the same memory 
location.
This patch flattens the two loops and instead iterates over ncopies * 
vec_num
and simplified the indexing.

This does not fully fix the gcc_r miscompile error in SPECCPU 2017 as the 
error
moves somewhere else.  I will look at that next but fixes some other 
libraries
that also started failing.

gcc/ChangeLog:

PR tree-optimization/117557
* tree-vect-stmts.cc (vectorizable_store): Flatten the ncopies and
vec_num loops.

gcc/testsuite/ChangeLog:

PR tree-optimization/117557
* gcc.target/aarch64/pr117557.c: New test.

Diff:
---
 gcc/testsuite/gcc.target/aarch64/pr117557.c |  29 ++
 gcc/tree-vect-stmts.cc  | 504 ++--
 2 files changed, 281 insertions(+), 252 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/pr117557.c 
b/gcc/testsuite/gcc.target/aarch64/pr117557.c
new file mode 100644
index ..80b3fde41109
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr117557.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mcpu=neoverse-n1+sve -fdump-tree-vect" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include 
+#include 
+
+#define N 8
+#define L 8
+
+/*
+**f:
+** ...
+** ld1wz[0-9]+.s, p([0-9]+)/z, \[x[0-9]+, z[0-9]+.s, sxtw\]
+** ld1wz[0-9]+.s, p([0-9]+)/z, \[x[0-9]+, z[0-9]+.s, sxtw\]
+** st1wz[0-9]+.s, p\1, \[x[0-9]+, z[0-9]+.s, sxtw\]
+** incbx([0-9]+), all, mul #2
+** st1wz[0-9]+.s, p\2, \[x\3, z[0-9]+.s, sxtw\]
+** ret
+** ...
+*/
+void f(const uint8_t * restrict seq1,
+   const uint8_t *idx, uint8_t *seq_out) {
+  for (int i = 0; i < L; ++i) {
+uint8_t h = idx[i];
+memcpy((void *)&seq_out[i * N], (const void *)&seq1[h * N / 2], N / 2);
+  }
+}
+
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index c2d5818b2786..4759c274f3cc 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -9228,7 +9228,8 @@ vectorizable_store (vec_info *vinfo,
   gcc_assert (!grouped_store);
   auto_vec vec_offsets;
   unsigned int inside_cost = 0, prologue_cost = 0;
-  for (j = 0; j < ncopies; j++)
+  int num_stmts = ncopies * vec_num;
+  for (j = 0; j < num_stmts; j++)
{
  gimple *new_stmt;
  if (j == 0)
@@ -9246,14 +9247,14 @@ vectorizable_store (vec_info *vinfo,
vect_get_slp_defs (op_node, gvec_oprnds[0]);
  else
vect_get_vec_defs_for_operand (vinfo, first_stmt_info,
-  ncopies, op, gvec_oprnds[0]);
+  num_stmts, op, 
gvec_oprnds[0]);
  if (mask)
{
  if (slp_node)
vect_get_slp_defs (mask_node, &vec_masks);
  else
vect_get_vec_defs_for_operand (vinfo, stmt_info,
-  ncopies,
+  num_stmts,
   mask, &vec_masks,
   mask_vectype);

[gcc r14-11053] middle-end:For multiplication try swapping operands when matching complex multiply [PR116463]

2024-12-03 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:f01f01f0ebf8f5207096cb9650354210d890fe0d

commit r14-11053-gf01f01f0ebf8f5207096cb9650354210d890fe0d
Author: Tamar Christina 
Date:   Thu Nov 21 15:10:24 2024 +

middle-end:For multiplication try swapping operands when matching complex 
multiply [PR116463]

This commit fixes the failures of complex.exp=fast-math-complex-mls-*.c on 
the
GCC 14 branch and some of the ones on the master.

The current matching just looks for one order for multiplication and was 
relying
on canonicalization to always give the right order because of the 
TWO_OPERANDS.

However when it comes to the multiplication trying only one order is a bit
fragile as they can be flipped.

The failing tests on the branch are:

void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
   _Complex TYPE c[restrict N]) {
  for (int i = 0; i < N; i++)
c[i] -= a[i] * (b[i] * I * I);
}

void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
   _Complex TYPE c[restrict N]) {
  for (int i = 0; i < N; i++)
c[i] -= (a[i] * I * I) * b[i];
}

The issue is just a small difference in commutative operations.
we look for {R,R} * {R,I} but found {R,I} * {R,R}.

Since the DF analysis is cached, we should be able to swap operands and 
retry
for multiply cheaply.

There is a constraint being checked by vect_validate_multiplication for the 
data
flow of the operands feeding the multiplications.  So e.g.

between the nodes:

note:   node 0x4d1d210 (max_nunits=2, refcnt=3) vector(2) double
note:   op template: _27 = _10 * _25;
note:  stmt 0 _27 = _10 * _25;
note:  stmt 1 _29 = _11 * _25;
note:   node 0x4d1d060 (max_nunits=2, refcnt=2) vector(2) double
note:   op template: _26 = _11 * _24;
note:  stmt 0 _26 = _11 * _24;
note:  stmt 1 _28 = _10 * _24;

we require the lanes to come from the same source which
vect_validate_multiplication checks.  As such it doesn't make sense to flip 
them
individually because that would invalidate the earlier linear_loads_p checks
which have validated that the arguments all come from the same datarefs.

This patch thus flips the operands in unison to still maintain this 
invariant,
but also honor the commutative nature of multiplication.

gcc/ChangeLog:

PR tree-optimization/116463
* tree-vect-slp-patterns.cc (complex_mul_pattern::matches,
complex_fms_pattern::matches): Try swapping operands on multiply.

(cherry picked from commit a9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8)

Diff:
---
 gcc/tree-vect-slp-patterns.cc | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-vect-slp-patterns.cc b/gcc/tree-vect-slp-patterns.cc
index 4a582ec9512e..3bb283a3b5b4 100644
--- a/gcc/tree-vect-slp-patterns.cc
+++ b/gcc/tree-vect-slp-patterns.cc
@@ -1069,7 +1069,15 @@ complex_mul_pattern::matches (complex_operation_t op,
   enum _conj_status status;
   if (!vect_validate_multiplication (perm_cache, compat_cache, left_op,
 right_op, false, &status))
-return IFN_LAST;
+{
+  /* Try swapping the order and re-trying since multiplication is
+commutative.  */
+  std::swap (left_op[0], left_op[1]);
+  std::swap (right_op[0], right_op[1]);
+  if (!vect_validate_multiplication (perm_cache, compat_cache, left_op,
+right_op, false, &status))
+   return IFN_LAST;
+}
 
   if (status == CONJ_NONE)
 {
@@ -1286,7 +1294,15 @@ complex_fms_pattern::matches (complex_operation_t op,
   enum _conj_status status;
   if (!vect_validate_multiplication (perm_cache, compat_cache, right_op,
 left_op, true, &status))
-return IFN_LAST;
+{
+  /* Try swapping the order and re-trying since multiplication is
+commutative.  */
+  std::swap (left_op[0], left_op[1]);
+  std::swap (right_op[0], right_op[1]);
+  if (!vect_validate_multiplication (perm_cache, compat_cache, right_op,
+left_op, true, &status))
+   return IFN_LAST;
+}
 
   if (status == CONJ_NONE)
 ifn = IFN_COMPLEX_FMS;

[gcc r15-6654] cfgexpand: Factor out getting the stack decl index

2025-01-07 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:4b1a2878ba3241ec5c0a1bf05ff47bfcd09c3711

commit r15-6654-g4b1a2878ba3241ec5c0a1bf05ff47bfcd09c3711
Author: Andrew Pinski 
Date:   Fri Nov 15 20:22:02 2024 -0800

cfgexpand: Factor out getting the stack decl index

This is the first patch in improving this code.
Since there are a few places which get the index and they
check the same thing, let's factor that out into one function.

Bootstrapped and tested on x86_64-linux-gnu.

gcc/ChangeLog:

* cfgexpand.cc (INVALID_STACK_INDEX): New defined.
(decl_stack_index): New function.
(visit_op): Use decl_stack_index.
(visit_conflict): Likewise.
(add_scope_conflicts_1): Likewise.

Signed-off-by: Andrew Pinski 

Diff:
---
 gcc/cfgexpand.cc | 62 +---
 1 file changed, 37 insertions(+), 25 deletions(-)

diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index abab385293a5..cdebb00cd792 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -337,6 +337,8 @@ static unsigned stack_vars_alloc;
 static unsigned stack_vars_num;
 static hash_map *decl_to_stack_part;
 
+#define INVALID_STACK_INDEX ((unsigned)-1)
+
 /* Conflict bitmaps go on this obstack.  This allows us to destroy
all of them in one big sweep.  */
 static bitmap_obstack stack_var_bitmap_obstack;
@@ -525,6 +527,27 @@ stack_var_conflict_p (unsigned x, unsigned y)
   return bitmap_bit_p (a->conflicts, y);
 }
 
+/* Returns the DECL's index into the stack_vars array.
+   If the DECL does not exist return INVALID_STACK_INDEX.  */
+static unsigned
+decl_stack_index (tree decl)
+{
+  if (!decl)
+return INVALID_STACK_INDEX;
+  if (!DECL_P (decl))
+return INVALID_STACK_INDEX;
+  if (DECL_RTL_IF_SET (decl) != pc_rtx)
+return INVALID_STACK_INDEX;
+  unsigned *v = decl_to_stack_part->get (decl);
+  if (!v)
+return INVALID_STACK_INDEX;
+
+  unsigned indx = *v;
+  gcc_checking_assert (indx != INVALID_STACK_INDEX);
+  gcc_checking_assert (indx < stack_vars_num);
+  return indx;
+}
+
 /* Callback for walk_stmt_ops.  If OP is a decl touched by add_stack_var
enter its partition number into bitmap DATA.  */
 
@@ -533,14 +556,9 @@ visit_op (gimple *, tree op, tree, void *data)
 {
   bitmap active = (bitmap)data;
   op = get_base_address (op);
-  if (op
-  && DECL_P (op)
-  && DECL_RTL_IF_SET (op) == pc_rtx)
-{
-  unsigned *v = decl_to_stack_part->get (op);
-  if (v)
-   bitmap_set_bit (active, *v);
-}
+  unsigned idx = decl_stack_index (op);
+  if (idx != INVALID_STACK_INDEX)
+bitmap_set_bit (active, idx);
   return false;
 }
 
@@ -553,20 +571,15 @@ visit_conflict (gimple *, tree op, tree, void *data)
 {
   bitmap active = (bitmap)data;
   op = get_base_address (op);
-  if (op
-  && DECL_P (op)
-  && DECL_RTL_IF_SET (op) == pc_rtx)
+  unsigned num = decl_stack_index (op);
+  if (num != INVALID_STACK_INDEX
+  && bitmap_set_bit (active, num))
 {
-  unsigned *v = decl_to_stack_part->get (op);
-  if (v && bitmap_set_bit (active, *v))
-   {
- unsigned num = *v;
- bitmap_iterator bi;
- unsigned i;
- gcc_assert (num < stack_vars_num);
- EXECUTE_IF_SET_IN_BITMAP (active, 0, i, bi)
-   add_stack_var_conflict (num, i);
-   }
+  bitmap_iterator bi;
+  unsigned i;
+  gcc_assert (num < stack_vars_num);
+  EXECUTE_IF_SET_IN_BITMAP (active, 0, i, bi)
+   add_stack_var_conflict (num, i);
 }
   return false;
 }
@@ -638,15 +651,14 @@ add_scope_conflicts_1 (basic_block bb, bitmap work, bool 
for_conflict)
   if (gimple_clobber_p (stmt))
{
  tree lhs = gimple_assign_lhs (stmt);
- unsigned *v;
  /* Handle only plain var clobbers.
 Nested functions lowering and C++ front-end inserts clobbers
 which are not just plain variables.  */
  if (!VAR_P (lhs))
continue;
- if (DECL_RTL_IF_SET (lhs) == pc_rtx
- && (v = decl_to_stack_part->get (lhs)))
-   bitmap_clear_bit (work, *v);
+ unsigned indx = decl_stack_index (lhs);
+ if (indx != INVALID_STACK_INDEX)
+   bitmap_clear_bit (work, indx);
}
   else if (!is_gimple_debug (stmt))
{

[gcc r15-6656] cfgexpand: Handle integral vector types and constructors for scope conflicts [PR105769]

2025-01-07 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:4f4722b0722ec343df70e5ec5fd9d5c682ff8149

commit r15-6656-g4f4722b0722ec343df70e5ec5fd9d5c682ff8149
Author: Andrew Pinski 
Date:   Fri Nov 15 20:22:04 2024 -0800

cfgexpand: Handle integral vector types and constructors for scope 
conflicts [PR105769]

This is an expansion of the last patch to also track pointers via vector 
types and the
constructor that are used with vector types.
In this case we had:
```
_15 = (long unsigned int) &bias;
_10 = (long unsigned int) &cov_jn;
_12 = {_10, _15};
...

MEM[(struct vec *)&cov_jn] ={v} {CLOBBER(bob)};
bias ={v} {CLOBBER(bob)};
MEM[(struct function *)&D.6156] ={v} {CLOBBER(bob)};

...
MEM  [(void *)&D.6172 + 32B] = _12;
MEM[(struct function *)&D.6157] ={v} {CLOBBER(bob)};
```

Anyways tracking the pointers via vector types to say they are alive
at the point where the store of the vector happens fixes the bug by saying
it is alive at the same time as another variable is alive.

Bootstrapped and tested on x86_64-linux-gnu.

PR tree-optimization/105769

gcc/ChangeLog:

* cfgexpand.cc (vars_ssa_cache::operator()): For constructors
walk over the elements.

gcc/testsuite/ChangeLog:

* g++.dg/torture/pr105769-1.C: New test.

Signed-off-by: Andrew Pinski 

Diff:
---
 gcc/cfgexpand.cc  | 20 +++--
 gcc/testsuite/g++.dg/torture/pr105769-1.C | 67 +++
 2 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index f6c9f7755a4c..2b27076658fd 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -728,7 +728,7 @@ vars_ssa_cache::operator() (tree name)
   gcc_assert (TREE_CODE (name) == SSA_NAME);
 
   if (!POINTER_TYPE_P (TREE_TYPE (name))
-  && !INTEGRAL_TYPE_P (TREE_TYPE (name)))
+  && !ANY_INTEGRAL_TYPE_P (TREE_TYPE (name)))
 return empty;
 
   if (exists (name))
@@ -758,7 +758,7 @@ vars_ssa_cache::operator() (tree name)
continue;
 
   if (!POINTER_TYPE_P (TREE_TYPE (use))
- && !INTEGRAL_TYPE_P (TREE_TYPE (use)))
+ && !ANY_INTEGRAL_TYPE_P (TREE_TYPE (use)))
continue;
 
   /* Mark the old ssa name needs to be update from the use. */
@@ -772,10 +772,22 @@ vars_ssa_cache::operator() (tree name)
 so we don't go into an infinite loop for some phi nodes with loops.  */
   create (use);
 
+  gimple *g = SSA_NAME_DEF_STMT (use);
+ 
+  /* CONSTRUCTOR here is always a vector initialization,
+walk each element too. */
+  if (gimple_assign_single_p (g)
+ && TREE_CODE (gimple_assign_rhs1 (g)) == CONSTRUCTOR)
+   {
+ tree ctr = gimple_assign_rhs1 (g);
+ unsigned i;
+ tree elm;
+ FOR_EACH_CONSTRUCTOR_VALUE (CONSTRUCTOR_ELTS (ctr), i, elm)
+   work_list.safe_push (std::make_pair (elm, use));
+   }
   /* For assignments, walk each operand for possible addresses.
 For PHI nodes, walk each argument. */
-  gimple *g = SSA_NAME_DEF_STMT (use);
-  if (gassign *a = dyn_cast  (g))
+  else if (gassign *a = dyn_cast  (g))
{
  /* operand 0 is the lhs. */
  for (unsigned i = 1; i < gimple_num_ops (g); i++)
diff --git a/gcc/testsuite/g++.dg/torture/pr105769-1.C 
b/gcc/testsuite/g++.dg/torture/pr105769-1.C
new file mode 100644
index ..3fe973656b84
--- /dev/null
+++ b/gcc/testsuite/g++.dg/torture/pr105769-1.C
@@ -0,0 +1,67 @@
+// { dg-do run }
+
+// PR tree-optimization/105769
+
+// The partitioning code would incorrectly have bias
+// and a temporary in the same partitioning because
+// it was thought bias was not alive when those were alive
+// do to vectorization of a store of pointers (that included bias).
+
+#include 
+
+template
+struct vec {
+  T dat[n];
+  vec() {}
+  explicit vec(const T& x) { for(size_t i = 0; i < n; i++) dat[i] = x; }
+  T& operator [](size_t i) { return dat[i]; }
+  const T& operator [](size_t i) const { return dat[i]; }
+};
+
+template
+using mat = vec>;
+template
+using sq_mat = mat;
+using map_t = std::function;
+template
+using est_t = std::function;
+template using est2_t = std::function;
+map_t id_map() { return [](size_t j) -> size_t { return j; }; }
+
+template
+est2_t jacknife(const est_t> est, sq_mat& cov, vec& bias) {
+  return [est, &cov, &bias](map_t map) -> void
+  {
+bias = est(map);
+for(size_t i = 0; i < n; i++)
+{
+  bias[i].print();
+}
+  };
+}
+
+template
+void print_cov_ratio() {
+  sq_mat<2, T> cov_jn;
+  vec<2, T> bias;
+  jacknife<2, T>([](map_t map) -> vec<2, T> { vec<2, T> retv; retv[0] = 1; 
retv[1] = 1; return retv; }, cov_jn, bias)(id_map());
+}
+struct ab {
+  long long unsigned a;
+  short unsigned b;
+  double operator()() { return a; }
+  ab& operator=(double rhs) { a = rhs; return *this; }
+ void print();
+};

[gcc r15-6655] cfgexpand: Rewrite add_scope_conflicts_2 to use cache and look back further [PR111422]

2025-01-07 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:0014a858a14b825818d6b557c3d5193f85790bde

commit r15-6655-g0014a858a14b825818d6b557c3d5193f85790bde
Author: Andrew Pinski 
Date:   Fri Nov 15 20:22:03 2024 -0800

cfgexpand: Rewrite add_scope_conflicts_2 to use cache and look back further 
[PR111422]

After fixing loop-im to do the correct overflow rewriting
for pointer types too. We end up with code like:
```
_9 = (unsigned long) &g;
_84 = _9 + 18446744073709551615;
_11 = _42 + _84;
_44 = (signed char *) _11;
...
*_44 = 10;
g ={v} {CLOBBER(eos)};
...
n[0] = &f;
*_44 = 8;
g ={v} {CLOBBER(eos)};
```

Which was not being recongized by the scope conflicts code.
This was because it only handled one level walk backs rather than multiple 
ones.
This fixes the issue by having a cache which records all references to 
addresses
of stack variables.

Unlike the previous patch, this only records and looks at addresses of 
stack variables.
The cache uses a bitmap and uses the index as the bit to look at.

PR middle-end/117426
PR middle-end/111422
gcc/ChangeLog:

* cfgexpand.cc (struct vars_ssa_cache): New class.
(vars_ssa_cache::vars_ssa_cache): New constructor.
(vars_ssa_cache::~vars_ssa_cache): New deconstructor.
(vars_ssa_cache::create): New method.
(vars_ssa_cache::exists): New method.
(vars_ssa_cache::add_one): New method.
(vars_ssa_cache::update): New method.
(vars_ssa_cache::dump): New method.
(add_scope_conflicts_2): Factor mostly out to
vars_ssa_cache::operator(). New cache argument.
Walk the bitmap cache for the stack variables addresses.
(vars_ssa_cache::operator()): New method factored out from
add_scope_conflicts_2. Rewrite to be a full walk of all operands
and use a worklist.
(add_scope_conflicts_1): Add cache new argument for the addr cache.
Just call add_scope_conflicts_2 for the phi result instead of 
calling
for the uses and don't call walk_stmt_load_store_addr_ops for phis.
Update call to add_scope_conflicts_2 to add cache argument.
(add_scope_conflicts): Add cache argument and update calls to
add_scope_conflicts_1.

gcc/testsuite/ChangeLog:

* gcc.dg/torture/pr117426-1.c: New test.

Signed-off-by: Andrew Pinski 

Diff:
---
 gcc/cfgexpand.cc  | 292 ++
 gcc/testsuite/gcc.dg/torture/pr117426-1.c |  53 ++
 2 files changed, 308 insertions(+), 37 deletions(-)

diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index cdebb00cd792..f6c9f7755a4c 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -584,35 +584,243 @@ visit_conflict (gimple *, tree op, tree, void *data)
   return false;
 }
 
-/* Helper function for add_scope_conflicts_1.  For USE on
-   a stmt, if it is a SSA_NAME and in its SSA_NAME_DEF_STMT is known to be
-   based on some ADDR_EXPR, invoke VISIT on that ADDR_EXPR.  */
+/* A cache for ssa name to address of stack variables.
+   When taking into account if a ssa name refers to an
+   address of a stack variable, we need to walk the
+   expressions backwards to find the addresses. This
+   cache is there so we don't need to walk the expressions
+   all the time.  */
+struct vars_ssa_cache
+{
+private:
+  /* Currently an entry is a bitmap of all of the known stack variables
+ addresses that are referenced by the ssa name.
+ When the bitmap is the nullptr, then there is no cache.
+ Currently only empty bitmaps are shared.
+ The reason for why empty cache is not just a null is so we know the
+ cache for an entry is filled in.  */
+  struct entry
+  {
+bitmap bmap = nullptr;
+  };
+  entry *vars_ssa_caches;
+public:
 
-static inline void
-add_scope_conflicts_2 (tree use, bitmap work,
-  walk_stmt_load_store_addr_fn visit)
+  vars_ssa_cache();
+  ~vars_ssa_cache();
+  const_bitmap operator() (tree name);
+  void dump (FILE *file);
+
+private:
+  /* Can't copy. */
+  vars_ssa_cache(const vars_ssa_cache&) = delete;
+  vars_ssa_cache(vars_ssa_cache&&) = delete;
+
+  /* The shared empty bitmap.  */
+  bitmap empty;
+
+  /* Unshare the index, currently only need
+ to unshare if the entry was empty. */
+  void unshare(int indx)
+  {
+if (vars_ssa_caches[indx].bmap == empty)
+   vars_ssa_caches[indx].bmap = BITMAP_ALLOC (&stack_var_bitmap_obstack);
+  }
+  void create (tree);
+  bool exists (tree use);
+  void add_one (tree old_name, unsigned);
+  bool update (tree old_name, tree use);
+};
+
+/* Constructor of the cache, create the cache array. */
+vars_ssa_cache::vars_ssa_cache ()
+{
+  vars_ssa_caches = new entry[num_ssa_names]{};
+
+  /* Create the shared empty bitmap too. */
+  empty = BITMAP_ALLOC (&stack_var_bitmap_

[gcc r15-6657] perform affine fold to unsigned on non address expressions. [PR114932]

2025-01-07 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:405c99c17210a58df1a237219e773e689f17

commit r15-6657-g405c99c17210a58df1a237219e773e689f17
Author: Tamar Christina 
Date:   Mon Jan 6 17:52:14 2025 +

perform affine fold to unsigned on non address expressions. [PR114932]

When the patch for PR114074 was applied we saw a good boost in exchange2.

This boost was partially caused by a simplification of the addressing modes.
With the patch applied IV opts saw the following form for the base 
addressing;

  Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
324) + 36)

vs what we normally get:

  Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
* 81) + 9) * 4

This is because the patch promoted multiplies where one operand is a 
constant
from a signed multiply to an unsigned one, to attempt to fold away the 
constant.

This patch attempts the same but due to the various problems with SCEV and
niters not being able to analyze the resulting forms (i.e. PR114322) we 
can't
do it during SCEV or in the general form like in fold-const like 
extract_muldiv
attempts.

Instead this applies the simplification during IVopts initialization when we
create the IV. This allows IV opts to see the simplified form without
influencing the rest of the compiler.

as mentioned in PR114074 it would be good to fix the missed optimization in 
the
other passes so we can perform this in general.

The reason this has a big impact on Fortran code is that Fortran doesn't 
seem to
have unsigned integer types.  As such all it's addressing are created with
signed types and folding does not happen on them due to the possible 
overflow.

concretely on AArch64 this changes the results from generation:

mov x27, -108
mov x24, -72
mov x23, -36
add x21, x1, x0, lsl 2
add x19, x20, x22
.L5:
add x0, x22, x19
add x19, x19, 324
ldr d1, [x0, x27]
add v1.2s, v1.2s, v15.2s
str d1, [x20, 216]
ldr d0, [x0, x24]
add v0.2s, v0.2s, v15.2s
str d0, [x20, 252]
ldr d31, [x0, x23]
add v31.2s, v31.2s, v15.2s
str d31, [x20, 288]
bl  digits_20_
cmp x21, x19
bne .L5

into:

.L5:
ldr d1, [x19, -108]
add v1.2s, v1.2s, v15.2s
str d1, [x20, 216]
ldr d0, [x19, -72]
add v0.2s, v0.2s, v15.2s
str d0, [x20, 252]
ldr d31, [x19, -36]
add x19, x19, 324
add v31.2s, v31.2s, v15.2s
str d31, [x20, 288]
bl  digits_20_
cmp x21, x19
bne .L5

The two patches together results in a 10% performance increase in exchange2 
in
SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in 
compile
time. There's also a 5% performance improvement in fotonik3d and similar
reduction in binary size.

The patch folds every IV to unsigned to canonicalize them.  At the end of 
the
pass we match.pd will then remove unneeded conversions.

Note that we cannot force everything to unsigned, IVops requires that array
address expressions remain as such.  Folding them results in them becoming
pointer expressions for which some optimizations in IVopts do not run.

gcc/ChangeLog:

PR tree-optimization/114932
* tree-ssa-loop-ivopts.cc (alloc_iv): Perform affine unsigned fold.

gcc/testsuite/ChangeLog:

PR tree-optimization/114932
* gcc.dg/tree-ssa/pr64705.c: Update dump file scan.
* gcc.target/i386/pr115462.c: The testcase shares 3 IVs which 
calculates
the same thing but with a slightly different increment offset.  The 
test
checks for 3 complex addressing loads, one for each IV.  But with 
this
change they now all share one IV.  That is the loop now only has one
complex addressing.  This is ultimately driven by the backend 
costing
and the current costing says this is preferred so updating the 
testcase.
* gfortran.dg/addressing-modes_1.f90: New test.

Diff:
---
 gcc/testsuite/gcc.dg/tree-ssa/pr64705.c  |  2 +-
 gcc/testsuite/gcc.target/i386/pr115462.c |  2 +-
 gcc/testsuite/gfortran.dg/addressing-modes_1.f90 | 37 
 gcc/tree-ssa-loop-ivopts.cc  | 20 ++---
 4 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c
index fd24e38a53e9..3c9c2e5deed1 100644

[gcc r15-6597] AArch64: Implement four and eight chunk VLA concats [PR118272]

2025-01-06 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:830bead4859cd00da87e1304ba249cf0d3eb5a5a

commit r15-6597-g830bead4859cd00da87e1304ba249cf0d3eb5a5a
Author: Tamar Christina 
Date:   Mon Jan 6 09:24:36 2025 +

AArch64: Implement four and eight chunk VLA concats [PR118272]

The following testcase

  #pragma GCC target ("+sve")
  extern char __attribute__ ((simd, const)) fn3 (int, short);
  void test_fn3 (float *a, float *b, double *c, int n)
  {
for (int i = 0; i < n; ++i)
  a[i] = fn3 (b[i], c[i]);
  }

at -Ofast ICEs because my previous patch only added support for combining 2
partial SVE vectors into a bigger vector.  However There can also 4 and 8
piece subvectors.

This patch fixes this by implementing the missing expansions.

gcc/ChangeLog:

PR target/96342
PR target/118272
* config/aarch64/aarch64-sve.md (vec_init,
vec_initvnx16qivnx2qi): New.
* config/aarch64/aarch64.cc 
(aarch64_sve_expand_vector_init_subvector):
Rewrite to support any arbitrary combinations.
* config/aarch64/iterators.md (SVE_NO2E): Update to use SVE_NO4E
(SVE_NO2E, Vquad): New.

gcc/testsuite/ChangeLog:

PR target/96342
PR target/118272
* gcc.target/aarch64/vect-simd-clone-3.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-sve.md  | 23 
 gcc/config/aarch64/aarch64.cc  | 42 +-
 gcc/config/aarch64/iterators.md| 12 +--
 .../gcc.target/aarch64/vect-simd-clone-3.c | 27 ++
 4 files changed, 93 insertions(+), 11 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 6b65d4eae2f2..ba4b4d904c77 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -2839,6 +2839,7 @@
   }
 )
 
+;; Vector constructor combining two half vectors { a, b }
 (define_expand "vec_init"
   [(match_operand:SVE_NO2E 0 "register_operand")
(match_operand 1 "")]
@@ -2849,6 +2850,28 @@
   }
 )
 
+;; Vector constructor combining four quad vectors { a, b, c, d }
+(define_expand "vec_init"
+  [(match_operand:SVE_NO4E 0 "register_operand")
+   (match_operand 1 "")]
+  "TARGET_SVE"
+  {
+aarch64_sve_expand_vector_init_subvector (operands[0], operands[1]);
+DONE;
+  }
+)
+
+;; Vector constructor combining eight vectors { a, b, c, d, ... }
+(define_expand "vec_initvnx16qivnx2qi"
+  [(match_operand:VNx16QI 0 "register_operand")
+   (match_operand 1 "")]
+  "TARGET_SVE"
+  {
+aarch64_sve_expand_vector_init_subvector (operands[0], operands[1]);
+DONE;
+  }
+)
+
 ;; Shift an SVE vector left and insert a scalar into element 0.
 (define_insn "vec_shl_insert_"
   [(set (match_operand:SVE_FULL 0 "register_operand")
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 916a00ce3325..9e69bc744499 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -24879,18 +24879,42 @@ aarch64_sve_expand_vector_init_subvector (rtx target, 
rtx vals)
   machine_mode mode = GET_MODE (target);
   int nelts = XVECLEN (vals, 0);
 
-  gcc_assert (nelts == 2);
+  gcc_assert (nelts % 2 == 0);
 
-  rtx arg0 = XVECEXP (vals, 0, 0);
-  rtx arg1 = XVECEXP (vals, 0, 1);
-
-  /* If we have two elements and are concatting vector.  */
-  machine_mode elem_mode = GET_MODE (arg0);
+  /* We have to be concatting vector.  */
+  machine_mode elem_mode = GET_MODE (XVECEXP (vals, 0, 0));
   gcc_assert (VECTOR_MODE_P (elem_mode));
 
-  arg0 = force_reg (elem_mode, arg0);
-  arg1 = force_reg (elem_mode, arg1);
-  emit_insn (gen_aarch64_pack_partial (mode, target, arg0, arg1));
+  auto_vec worklist;
+  machine_mode wider_mode = elem_mode;
+
+  for (int i = 0; i < nelts; i++)
+worklist.safe_push (force_reg (elem_mode, XVECEXP (vals, 0, i)));
+
+  /* Keep widening pairwise to have maximum throughput.  */
+  while (nelts >= 2)
+{
+  wider_mode
+   = related_vector_mode (wider_mode, GET_MODE_INNER (wider_mode),
+  GET_MODE_NUNITS (wider_mode) * 2).require ();
+
+  for (int i = 0; i < nelts; i += 2)
+   {
+ rtx arg0 = worklist[i];
+ rtx arg1 = worklist[i+1];
+ gcc_assert (GET_MODE (arg0) == GET_MODE (arg1));
+
+ rtx tmp = gen_reg_rtx (wider_mode);
+ emit_insn (gen_aarch64_pack_partial (wider_mode, tmp, arg0, arg1));
+ worklist[i / 2] = tmp;
+   }
+
+  nelts /= 2;
+}
+
+  gcc_assert (wider_mode == mode);
+  emit_move_insn (target, worklist[0]);
+
   return;
 }
 
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 7c9bc89d0ddd..ff0f34dd0430 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -140,9 +140,12 @@
 ;; VQ without 2 element modes.
 (define_mode_iterator VQ_NO2E [V16QI V8HI V

[gcc r15-7453] testsuite: Fix two testisms on x86 after PFA [PR118754]

2025-02-10 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:aaf5f5027d3f29c6c0d836753dddac16ba94a49a

commit r15-7453-gaaf5f5027d3f29c6c0d836753dddac16ba94a49a
Author: Tamar Christina 
Date:   Mon Feb 10 09:32:29 2025 +

testsuite: Fix two testisms on x86 after PFA [PR118754]

These two tests now vectorize the result finding
loop with PFA and so the number of loops checked
fails.

This fixes them by adding #pragma GCC novector to
the testcases.

gcc/testsuite/ChangeLog:

PR testsuite/118754
* gcc.dg/vect/vect-tail-nomask-1.c: Add novector.
* gcc.target/i386/pr106010-8c.c: Likewise.

Diff:
---
 gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c | 2 ++
 gcc/testsuite/gcc.target/i386/pr106010-8c.c| 1 +
 2 files changed, 3 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c
index ee9ab2e9d910..116a7aefca6c 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c
@@ -72,6 +72,7 @@ run_test ()
 
   init_data (a, b, c, SIZE);
   test_citer (a, b, c);
+#pragma GCC novector
   for (i = 0; i < SIZE; i++)
 if (c[i] != a[i] + b[i])
   __builtin_abort ();
@@ -80,6 +81,7 @@ run_test ()
 
   init_data (a, b, c, SIZE);
   test_viter (a, b, c, SIZE);
+#pragma GCC novector
   for (i = 0; i < SIZE; i++)
 if (c[i] != a[i] + b[i])
   __builtin_abort ();
diff --git a/gcc/testsuite/gcc.target/i386/pr106010-8c.c 
b/gcc/testsuite/gcc.target/i386/pr106010-8c.c
index 61ae131829dc..76a3b3cbb628 100644
--- a/gcc/testsuite/gcc.target/i386/pr106010-8c.c
+++ b/gcc/testsuite/gcc.target/i386/pr106010-8c.c
@@ -30,6 +30,7 @@ do_test (void)
   __builtin_memset (ph_dst, 0, 2 * N * sizeof (_Float16));
 
   foo_ph (ph_dst);
+#pragma GCC novector
   for (int i = 0; i != N; i++)
 {
   if (ph_dst[i] != ph_src)

[gcc r15-7395] middle-end: Remove unused internal function after IVopts cleanup [PR118756]

2025-02-06 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:8d19fbb2be487f19ed1c48699e17cafe19520525

commit r15-7395-g8d19fbb2be487f19ed1c48699e17cafe19520525
Author: Tamar Christina 
Date:   Thu Feb 6 17:46:52 2025 +

middle-end: Remove unused internal function after IVopts cleanup [PR118756]

It seems that after my IVopts patches the function contain_complex_addr_expr
became unused and clang is rightfully complaining about it.

This removes the unused internal function.

gcc/ChangeLog:

PR tree-optimization/118756
* tree-ssa-loop-ivopts.cc (contain_complex_addr_expr): Remove.

Diff:
---
 gcc/tree-ssa-loop-ivopts.cc | 28 
 1 file changed, 28 deletions(-)

diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index 989321137df9..e37b24062f73 100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -1149,34 +1149,6 @@ determine_base_object (struct ivopts_data *data, tree 
expr)
   return obj;
 }
 
-/* Return true if address expression with non-DECL_P operand appears
-   in EXPR.  */
-
-static bool
-contain_complex_addr_expr (tree expr)
-{
-  bool res = false;
-
-  STRIP_NOPS (expr);
-  switch (TREE_CODE (expr))
-{
-case POINTER_PLUS_EXPR:
-case PLUS_EXPR:
-case MINUS_EXPR:
-  res |= contain_complex_addr_expr (TREE_OPERAND (expr, 0));
-  res |= contain_complex_addr_expr (TREE_OPERAND (expr, 1));
-  break;
-
-case ADDR_EXPR:
-  return (!DECL_P (TREE_OPERAND (expr, 0)));
-
-default:
-  return false;
-}
-
-  return res;
-}
-
 /* Allocates an induction variable with given initial value BASE and step STEP
for loop LOOP.  NO_OVERFLOW implies the iv doesn't overflow.  */

[gcc r13-9373] AArch64: Fix GCC 13 backport of big.Little CPU detection [PR118800]

2025-02-12 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:fa5aedd841105329b2f65cb0ff418cb4427f255e

commit r13-9373-gfa5aedd841105329b2f65cb0ff418cb4427f255e
Author: Tamar Christina 
Date:   Wed Feb 12 10:38:21 2025 +

AArch64: Fix GCC 13 backport of big.Little CPU detection [PR118800]

On the GCC-13 branch the backport caused a failure due to the branch not 
having
generic-armv8-a and also it still treating the generic cpu special.  This 
made
it return NULL when trying to find the default CPU.

In GCC 13 we still had multiple structures with the same information and in 
this
case aarch64_cpu_data was missing the generic CPU which is in all_cores.

This corrects it by using "generc" instead and also adding it to
aarch64_cpu_data.

gcc/ChangeLog:

PR target/118800
* config/aarch64/driver-aarch64.cc (DEFAULT_CPU): Use generic 
instead of
generic-armv8-a.
(aarch64_cpu_data): Add generic.

gcc/testsuite/ChangeLog:

PR target/118800
* gcc.target/aarch64/cpunative/native_cpu_34.c: Update order.

Diff:
---
 gcc/config/aarch64/driver-aarch64.cc   | 3 ++-
 gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_34.c | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/driver-aarch64.cc 
b/gcc/config/aarch64/driver-aarch64.cc
index ff4660f469cd..acc44536629e 100644
--- a/gcc/config/aarch64/driver-aarch64.cc
+++ b/gcc/config/aarch64/driver-aarch64.cc
@@ -60,7 +60,7 @@ struct aarch64_core_data
 #define ALL_VARIANTS ((unsigned)-1)
 /* Default architecture to use if -mcpu=native did not detect a known CPU.  */
 #define DEFAULT_ARCH "8A"
-#define DEFAULT_CPU "generic-armv8-a"
+#define DEFAULT_CPU "generic"
 
 #define AARCH64_CORE(CORE_NAME, CORE_IDENT, SCHED, ARCH, FLAGS, COSTS, IMP, 
PART, VARIANT) \
   { CORE_NAME, #ARCH, IMP, PART, VARIANT, feature_deps::cpu_##CORE_IDENT },
@@ -68,6 +68,7 @@ struct aarch64_core_data
 static CONSTEXPR const aarch64_core_data aarch64_cpu_data[] =
 {
 #include "aarch64-cores.def"
+  { "generic", "armv8-a", 0, 0, ALL_VARIANTS, 0},
   { NULL, NULL, INVALID_IMP, INVALID_CORE, ALL_VARIANTS, 0 }
 };
 
diff --git a/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_34.c 
b/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_34.c
index 168140002a0f..d2ff8156d8fc 100644
--- a/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_34.c
+++ b/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_34.c
@@ -7,6 +7,6 @@ int main()
   return 0;
 }
 
-/* { dg-final { scan-assembler {\.arch armv8-a\+dotprod\+crc\+crypto\+sve2\n} 
} } */
+/* { dg-final { scan-assembler {\.arch armv8-a\+crc\+dotprod\+crypto\+sve2\n} 
} } */
 
 /* Test a normal looking procinfo.  */

[gcc r15-6109] middle-end: Add initial support for poly_int64 BIT_FIELD_REF in expand pass [PR96342]

2024-12-11 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:b6242bd122757ec6c75c73a4921f24a9a382b090

commit r15-6109-gb6242bd122757ec6c75c73a4921f24a9a382b090
Author: Victor Do Nascimento 
Date:   Wed Dec 11 12:00:58 2024 +

middle-end: Add initial support for poly_int64 BIT_FIELD_REF in expand pass 
[PR96342]

While `poly_int64' has been the default representation of  bitfield size
and offset for some time, there was a lack of support for the use of
non-constant `poly_int64' values for those values throughout the
compiler, limiting the applicability of the BIT_FIELD_REF rtl expression
for variable length vectors, such as those used by SVE.

This patch starts work on extending the functionality of relevant
functions in the expand pass such as to enable their use by the compiler
for such vectors.

gcc/ChangeLog:

PR target/96342
* expr.cc (store_constructor): Enable poly_{u}int64 type usage.
(get_inner_reference): Ditto.

Co-authored-by: Tamar Christina 

Diff:
---
 gcc/expr.cc | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 88fa56cb299d..babf00f34dcf 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -7901,15 +7901,14 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
   {
unsigned HOST_WIDE_INT idx;
constructor_elt *ce;
-   int i;
bool need_to_clear;
insn_code icode = CODE_FOR_nothing;
tree elt;
tree elttype = TREE_TYPE (type);
int elt_size = vector_element_bits (type);
machine_mode eltmode = TYPE_MODE (elttype);
-   HOST_WIDE_INT bitsize;
-   HOST_WIDE_INT bitpos;
+   poly_int64 bitsize;
+   poly_int64 bitpos;
rtvec vector = NULL;
poly_uint64 n_elts;
unsigned HOST_WIDE_INT const_n_elts;
@@ -8006,7 +8005,7 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
 ? TREE_TYPE (CONSTRUCTOR_ELT (exp, 0)->value)
 : elttype);
if (VECTOR_TYPE_P (val_type))
- bitsize = tree_to_uhwi (TYPE_SIZE (val_type));
+ bitsize = tree_to_poly_uint64 (TYPE_SIZE (val_type));
else
  bitsize = elt_size;
 
@@ -8019,12 +8018,12 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
  need_to_clear = true;
else
  {
-   unsigned HOST_WIDE_INT count = 0, zero_count = 0;
+   poly_uint64 count = 0, zero_count = 0;
tree value;
 
FOR_EACH_CONSTRUCTOR_VALUE (CONSTRUCTOR_ELTS (exp), idx, value)
  {
-   int n_elts_here = bitsize / elt_size;
+   poly_int64 n_elts_here = exact_div (bitsize, elt_size);
count += n_elts_here;
if (mostly_zeros_p (value))
  zero_count += n_elts_here;
@@ -8033,7 +8032,7 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
/* Clear the entire vector first if there are any missing elements,
   or if the incidence of zero elements is >= 75%.  */
need_to_clear = (maybe_lt (count, n_elts)
-|| 4 * zero_count >= 3 * count);
+|| maybe_gt (4 * zero_count, 3 * count));
  }
 
if (need_to_clear && maybe_gt (size, 0) && !vector)
@@ -8060,9 +8059,13 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
 
 /* Store each element of the constructor into the corresponding
   element of TARGET, determined by counting the elements.  */
-   for (idx = 0, i = 0;
-vec_safe_iterate (CONSTRUCTOR_ELTS (exp), idx, &ce);
-idx++, i += bitsize / elt_size)
+   HOST_WIDE_INT chunk_size = 0;
+   bool chunk_multiple_p = constant_multiple_p (bitsize, elt_size,
+&chunk_size);
+   gcc_assert (chunk_multiple_p || vec_vec_init_p);
+
+   for (idx = 0; vec_safe_iterate (CONSTRUCTOR_ELTS (exp), idx, &ce);
+idx++)
  {
HOST_WIDE_INT eltpos;
tree value = ce->value;
@@ -8073,7 +8076,7 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
if (ce->index)
  eltpos = tree_to_uhwi (ce->index);
else
- eltpos = i;
+ eltpos = idx * chunk_size;
 
if (vector)
  {
@@ -8461,10 +8464,8 @@ get_inner_reference (tree exp, poly_int64 *pbitsize,
 
   if (size_tree != 0)
 {
-  if (! tree_fits_uhwi_p (size_tree))
+  if (!poly_int_tree_p (size_tree, pbitsize))
mode = BLKmode, *pbitsize = -1;
-  else
-   *pbitsize = tree_to_uhwi (size_tree);
 }
 
   *preversep = reverse_storage_order_for_component_p (exp);

[gcc r15-6108] middle-end: add vec_init support for variable length subvector concatenation. [PR96342]

2024-12-11 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:d069eb91d5696a8642bd5fc44a6d47fd7f74d18b

commit r15-6108-gd069eb91d5696a8642bd5fc44a6d47fd7f74d18b
Author: Victor Do Nascimento 
Date:   Wed Dec 11 12:00:09 2024 +

middle-end: add vec_init support for variable length subvector 
concatenation. [PR96342]

For architectures where the vector-length is a compile-time variable,
rather representing a runtime constant, as is the case with SVE it is
perfectly reasonable that such vector be made up of two (or more) subvector
components of a compatible sub-length variable.

One example of this would be the concatenation of two VNx4QI vectors
into a single VNx8QI vector.

This patch adds initial support for the enablement of this feature in
the middle-end, removing the `.is_constant()' constraint on the vector's
number of elements, instead making the constant no. of elements the
multiple of the number of subvectors (which must then also be of
variable length, such that their polynomial ratio then results in a
compile-time constant) required to fill the vector.

gcc/ChangeLog:

PR target/96342
* expr.cc (store_constructor): add support for variable-length
vectors.

Co-authored-by: Tamar Christina 

Diff:
---
 gcc/expr.cc | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 980ac415cfc7..88fa56cb299d 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -7966,12 +7966,9 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
 
n_elts = TYPE_VECTOR_SUBPARTS (type);
if (REG_P (target)
-   && VECTOR_MODE_P (mode)
-   && n_elts.is_constant (&const_n_elts))
+   && VECTOR_MODE_P (mode))
  {
-   machine_mode emode = eltmode;
-   bool vector_typed_elts_p = false;
-
+   const_n_elts = 0;
if (CONSTRUCTOR_NELTS (exp)
&& (TREE_CODE (TREE_TYPE (CONSTRUCTOR_ELT (exp, 0)->value))
== VECTOR_TYPE))
@@ -7980,23 +7977,26 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
gcc_assert (known_eq (CONSTRUCTOR_NELTS (exp)
  * TYPE_VECTOR_SUBPARTS (etype),
  n_elts));
-   emode = TYPE_MODE (etype);
-   vector_typed_elts_p = true;
+
+   icode = convert_optab_handler (vec_init_optab, mode,
+  TYPE_MODE (etype));
+   const_n_elts = CONSTRUCTOR_NELTS (exp);
+   vec_vec_init_p = icode != CODE_FOR_nothing;
  }
-   icode = convert_optab_handler (vec_init_optab, mode, emode);
-   if (icode != CODE_FOR_nothing)
+   else if (exact_div (n_elts, GET_MODE_NUNITS (eltmode))
+   .is_constant (&const_n_elts))
  {
-   unsigned int n = const_n_elts;
-
-   if (vector_typed_elts_p)
- {
-   n = CONSTRUCTOR_NELTS (exp);
-   vec_vec_init_p = true;
- }
-   vector = rtvec_alloc (n);
-   for (unsigned int k = 0; k < n; k++)
- RTVEC_ELT (vector, k) = CONST0_RTX (emode);
+   /* For a non-const type vector, we check it is made up of
+  similarly non-const type vectors. */
+   icode = convert_optab_handler (vec_init_optab, mode, eltmode);
  }
+
+ if (const_n_elts && icode != CODE_FOR_nothing)
+   {
+ vector = rtvec_alloc (const_n_elts);
+ for (unsigned int k = 0; k < const_n_elts; k++)
+   RTVEC_ELT (vector, k) = CONST0_RTX (eltmode);
+   }
  }
 
/* Compute the size of the elements in the CTOR.  It differs

[gcc r15-6104] middle-end: refactor type to be explicit in operand_equal_p [PR114932]

2024-12-11 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:3c32575e5b6370270d38a80a7fa8eaa144e083d0

commit r15-6104-g3c32575e5b6370270d38a80a7fa8eaa144e083d0
Author: Tamar Christina 
Date:   Wed Dec 11 11:45:36 2024 +

middle-end: refactor type to be explicit in operand_equal_p  [PR114932]

This is a refactoring with no expected behavioral change.
The goal with this is to make the type of the expressions being used 
explicit.

I did not change all the recursive calls to operand_equal_p () to recurse
directly to the new function but instead this goes through the top level 
call
which re-extracts the types.

This was done because in most of the cases where we recurse type == arg.
The second patch makes use of this new flexibility to implement an overload
of operand_equal_p which checks for equality under two's complement.

gcc/ChangeLog:

PR tree-optimization/114932
* fold-const.cc (operand_compare::operand_equal_p): Split into one 
that
takes explicit type parameters and use that in public one.
* fold-const.h (class operand_compare): Add operand_equal_p private
overload.

Diff:
---
 gcc/fold-const.cc | 99 ---
 gcc/fold-const.h  |  6 
 2 files changed, 57 insertions(+), 48 deletions(-)

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index af2851ec0919..33dc3a731e45 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -3168,6 +3168,17 @@ combine_comparisons (location_t loc,
 bool
 operand_compare::operand_equal_p (const_tree arg0, const_tree arg1,
  unsigned int flags)
+{
+  return operand_equal_p (TREE_TYPE (arg0), arg0, TREE_TYPE (arg1), arg1, 
flags);
+}
+
+/* The same as operand_equal_p however the type of ARG0 and ARG1 are assumed 
to be
+   the TYPE0 and TYPE1 respectively.  */
+
+bool
+operand_compare::operand_equal_p (tree type0, const_tree arg0,
+ tree type1, const_tree arg1,
+ unsigned int flags)
 {
   bool r;
   if (verify_hash_value (arg0, arg1, flags, &r))
@@ -3178,25 +3189,25 @@ operand_compare::operand_equal_p (const_tree arg0, 
const_tree arg1,
 
   /* If either is ERROR_MARK, they aren't equal.  */
   if (TREE_CODE (arg0) == ERROR_MARK || TREE_CODE (arg1) == ERROR_MARK
-  || TREE_TYPE (arg0) == error_mark_node
-  || TREE_TYPE (arg1) == error_mark_node)
+  || type0 == error_mark_node
+  || type1 == error_mark_node)
 return false;
 
   /* Similar, if either does not have a type (like a template id),
  they aren't equal.  */
-  if (!TREE_TYPE (arg0) || !TREE_TYPE (arg1))
+  if (!type0 || !type1)
 return false;
 
   /* Bitwise identity makes no sense if the values have different layouts.  */
   if ((flags & OEP_BITWISE)
-  && !tree_nop_conversion_p (TREE_TYPE (arg0), TREE_TYPE (arg1)))
+  && !tree_nop_conversion_p (type0, type1))
 return false;
 
   /* We cannot consider pointers to different address space equal.  */
-  if (POINTER_TYPE_P (TREE_TYPE (arg0))
-  && POINTER_TYPE_P (TREE_TYPE (arg1))
-  && (TYPE_ADDR_SPACE (TREE_TYPE (TREE_TYPE (arg0)))
- != TYPE_ADDR_SPACE (TREE_TYPE (TREE_TYPE (arg1)
+  if (POINTER_TYPE_P (type0)
+  && POINTER_TYPE_P (type1)
+  && (TYPE_ADDR_SPACE (TREE_TYPE (type0))
+ != TYPE_ADDR_SPACE (TREE_TYPE (type1
 return false;
 
   /* Check equality of integer constants before bailing out due to
@@ -3216,19 +3227,20 @@ operand_compare::operand_equal_p (const_tree arg0, 
const_tree arg1,
 because they may change the signedness of the arguments.  As pointers
 strictly don't have a signedness, require either two pointers or
 two non-pointers as well.  */
-  if (TYPE_UNSIGNED (TREE_TYPE (arg0)) != TYPE_UNSIGNED (TREE_TYPE (arg1))
- || POINTER_TYPE_P (TREE_TYPE (arg0))
-!= POINTER_TYPE_P (TREE_TYPE (arg1)))
+  if (TYPE_UNSIGNED (type0) != TYPE_UNSIGNED (type1)
+ || POINTER_TYPE_P (type0) != POINTER_TYPE_P (type1))
return false;
 
   /* If both types don't have the same precision, then it is not safe
 to strip NOPs.  */
-  if (element_precision (TREE_TYPE (arg0))
- != element_precision (TREE_TYPE (arg1)))
+  if (element_precision (type0) != element_precision (type1))
return false;
 
   STRIP_NOPS (arg0);
   STRIP_NOPS (arg1);
+
+  type0 = TREE_TYPE (arg0);
+  type1 = TREE_TYPE (arg1);
 }
 #if 0
   /* FIXME: Fortran FE currently produce ADDR_EXPR of NOP_EXPR. Enable the
@@ -3287,9 +3299,9 @@ operand_compare::operand_equal_p (const_tree arg0, 
const_tree arg1,
 
   /* When not checking adddresses, this is needed for conversions and for
  COMPONENT_REF.  Might as well play it safe and always test this.  */
-  if (TREE_CODE (TREE_TYPE (arg0)) == ERROR_MARK
-  || TREE_CODE (TREE_TYPE (arg1)) == ERROR_MARK
-

[gcc r15-6105] middle-end: use two's complement equality when comparing IVs during candidate selection [PR114932]

2024-12-11 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:9403b035befe3537c343f7430e321468c0f2c28b

commit r15-6105-g9403b035befe3537c343f7430e321468c0f2c28b
Author: Tamar Christina 
Date:   Wed Dec 11 11:47:49 2024 +

middle-end: use two's complement equality when comparing IVs during 
candidate selection  [PR114932]

IVOPTS normally uses affine trees to perform comparisons between different 
IVs,
but these seem to have been missing in two key spots and instead normal tree
equivalencies used.

In some cases where we have a two-complements equivalence but not a strict
signedness equivalencies we end up generating both a signed and unsigned IV 
for
the same candidate.

This patch implements a new OEP flag called OEP_ASSUME_WRAPV.  This flag 
will
check if the operands would produce the same bit values after the 
computations
even if the final sign is different.

This happens quite a lot with Fortran but can also happen in C because this 
came
code is unable to figure out when one expression is a multiple of another.

As an example in the attached testcase we get:

Initial set of candidates:
  cost: 24 (complexity 3)
  reg_cost: 9
  cand_cost: 15
  cand_group_cost: 0 (complexity 3)
  candidates: 1, 6, 8
   group:0 --> iv_cand:6, cost=(0,1)
   group:1 --> iv_cand:1, cost=(0,0)
   group:2 --> iv_cand:8, cost=(0,1)
   group:3 --> iv_cand:8, cost=(0,1)
  invariant variables: 6
  invariant expressions: 1, 2

:
inv_expr 1: stride.3_27 * 4
inv_expr 2: (unsigned long) stride.3_27 * 4

These end up being used in the same group:

Group 1:
cand  costcompl.  inv.expr.   inv.vars
1 0   0   NIL;6
2 0   0   NIL;6
3 0   0   NIL;6

which ends up with IV opts picking the signed and unsigned IVs:

Improved to:
  cost: 24 (complexity 3)
  reg_cost: 9
  cand_cost: 15
  cand_group_cost: 0 (complexity 3)
  candidates: 1, 6, 8
   group:0 --> iv_cand:6, cost=(0,1)
   group:1 --> iv_cand:1, cost=(0,0)
   group:2 --> iv_cand:8, cost=(0,1)
   group:3 --> iv_cand:8, cost=(0,1)
  invariant variables: 6
  invariant expressions: 1, 2

and so generates the same IV as both signed and unsigned:

;;   basic block 21, loop depth 3, count 214748368 (estimated locally, freq 
58.2545), maybe hot
;;prev block 28, next block 31, flags: (NEW, REACHABLE, VISITED)
;;pred:   28 [always]  count:23622320 (estimated locally, freq 
6.4080) (FALLTHRU,EXECUTABLE)
;;25 [always]  count:191126046 (estimated locally, freq 
51.8465) (FALLTHRU,DFS_BACK,EXECUTABLE)
  # .MEM_66 = PHI <.MEM_34(28), .MEM_22(25)>
  # ivtmp.22_41 = PHI <0(28), ivtmp.22_82(25)>
  # ivtmp.26_51 = PHI 
  # ivtmp.28_90 = PHI 

...

;;   basic block 24, loop depth 3, count 214748366 (estimated locally, freq 
58.2545), maybe hot
;;prev block 22, next block 25, flags: (NEW, REACHABLE, VISITED)'
;;pred:   22 [always]  count:95443719 (estimated locally, freq 
25.8909) (FALLTHRU)
;;21 [33.3% (guessed)]  count:71582790 (estimated locally, 
freq 19.4182) (TRUE_VALUE,EXECUTABLE)
;;31 [33.3% (guessed)]  count:47721860 (estimated locally, 
freq 12.9455) (TRUE_VALUE,EXECUTABLE)
# .MEM_22 = PHI <.MEM_44(22), .MEM_31(21), .MEM_79(31)>
ivtmp.22_82 = ivtmp.22_41 + 1;
ivtmp.26_72 = ivtmp.26_51 + _80;
ivtmp.28_98 = ivtmp.28_90 + _39;

These two IVs are always used as unsigned, so IV ops generates:

  _73 = stride.3_27 * 4;
  _80 = (unsigned long) _73;
  _54 = (unsigned long) stride.3_27;
  _39 = _54 * 4;

Which means that in e.g. exchange2 we generate a lot of duplicate code.

This is because candidate 6 and 8 are equivalent under two's complement but 
have
different signs.

This patch changes it so that if you have two IVs that are affine 
equivalent to
just pick one over the other.  IV already has code for this, so the patch 
just
uses affine trees instead of tree for the check.

With it we get:

:
inv_expr 1: stride.3_27 * 4

:
Group 0:
  cand  costcompl.  inv.expr.   inv.vars
  5 0   2   NIL;NIL;
  6 0   3   NIL;NIL;

Group 1:
  cand  costcompl.  inv.expr.   inv.vars
  1 0   0   NIL;6
  2 0   0   NIL;6
  3 0   0   NIL;6
  4 0   0   NIL;6

Initial set of candidates:
  cost: 16 (complexity 3)
  reg_cost: 6
  cand_cost: 10
  cand_group_cost: 0 (complexity 3)
  candidates: 1, 6
   group:0 --> iv_cand:6, cost=(0,3)
   group:1 --> iv_cand:1, cost=(0,0)
  invariant variables: 6
  invariant expressions:

[gcc r15-6107] middle-end: Fix mask length arg in call to vect_get_loop_mask [PR96342]

2024-12-11 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:240cbd2f26c0f1c1f83cfc3b69cc0271b56172e2

commit r15-6107-g240cbd2f26c0f1c1f83cfc3b69cc0271b56172e2
Author: Victor Do Nascimento 
Date:   Wed Dec 11 11:58:55 2024 +

middle-end: Fix mask length arg in call to vect_get_loop_mask [PR96342]

When issuing multiple calls to a simdclone in a vectorized loop,
TYPE_VECTOR_SUBPARTS(vectype) gives the incorrect number when compared
to the TYPE_VECTOR_SUBPARTS result we get from the mask type derived
from the relevant `rgroup_controls' entry within `vect_get_loop_mask'.

By passing `masktype' instead, we are able to get the correct number of
vector subparts and thu eliminate the ICE in the call to
`vect_get_loop_mask' when the data type for which we retrieve the mask
is wider than the one used when defining the mask at mask registration
time.

gcc/ChangeLog:

PR target/96342
* tree-vect-stmts.cc (vectorizable_simd_clone_call):
s/vectype/masktype/ in call to vect_get_loop_mask.

Diff:
---
 gcc/tree-vect-stmts.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 497a31322acc..be1139a423c8 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4964,7 +4964,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
{
  vec_loop_masks *loop_masks = &LOOP_VINFO_MASKS (loop_vinfo);
  mask = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
-ncopies, vectype, j);
+ncopies, masktype, j);
}
  else
mask = vect_build_all_ones_mask (vinfo, stmt_info, masktype);

[gcc r15-6106] middle-end: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE [PR96342]

2024-12-11 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:561ef7c8477ba58ea64de259af9c2d0870afd9d4

commit r15-6106-g561ef7c8477ba58ea64de259af9c2d0870afd9d4
Author: Andre Vieira 
Date:   Wed Dec 11 11:50:22 2024 +

middle-end: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE [PR96342]

This patch adds stmt_vec_info to TARGET_SIMD_CLONE_USABLE to make sure the
target can reject a simd_clone based on the vector mode it is using.
This is needed because for VLS SVE vectorization the vectorizer accepts
Advanced SIMD simd clones when vectorizing using SVE types because the 
simdlens
might match.  This will cause type errors later on.

Other targets do not currently need to use this argument.

gcc/ChangeLog:

PR target/96342
* target.def (TARGET_SIMD_CLONE_USABLE): Add argument.
* tree-vect-stmts.cc (vectorizable_simd_clone_call): Pass stmt_info 
to
call TARGET_SIMD_CLONE_USABLE.
* config/aarch64/aarch64.cc (aarch64_simd_clone_usable): Add 
argument
and use it to reject the use of SVE simd clones with Advanced SIMD
modes.
* config/gcn/gcn.cc (gcn_simd_clone_usable): Add unused argument.
* config/i386/i386.cc (ix86_simd_clone_usable): Likewise.
* doc/tm.texi: Regenerate

Co-authored-by: Victor Do Nascimento 
Co-authored-by: Tamar Christina 

Diff:
---
 gcc/config/aarch64/aarch64.cc | 4 ++--
 gcc/config/gcn/gcn.cc | 3 ++-
 gcc/config/i386/i386.cc   | 2 +-
 gcc/doc/tm.texi   | 8 
 gcc/target.def| 8 
 gcc/tree-vect-stmts.cc| 9 -
 6 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 4d1b3cca0c42..77a2a6bfa3a3 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -29490,12 +29490,12 @@ aarch64_simd_clone_adjust (struct cgraph_node *node)
 /* Implement TARGET_SIMD_CLONE_USABLE.  */
 
 static int
-aarch64_simd_clone_usable (struct cgraph_node *node)
+aarch64_simd_clone_usable (struct cgraph_node *node, machine_mode vector_mode)
 {
   switch (node->simdclone->vecsize_mangle)
 {
 case 'n':
-  if (!TARGET_SIMD)
+  if (!TARGET_SIMD || aarch64_sve_mode_p (vector_mode))
return -1;
   return 0;
 default:
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index d017f22d1bc4..634171a0a93b 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -5653,7 +5653,8 @@ gcn_simd_clone_adjust (struct cgraph_node *ARG_UNUSED 
(node))
 /* Implement TARGET_SIMD_CLONE_USABLE.  */
 
 static int
-gcn_simd_clone_usable (struct cgraph_node *ARG_UNUSED (node))
+gcn_simd_clone_usable (struct cgraph_node *ARG_UNUSED (node),
+  machine_mode ARG_UNUSED (vector_mode))
 {
   /* We don't need to do anything here because
  gcn_simd_clone_compute_vecsize_and_simdlen currently only returns one
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 62f758b32ef5..ca763e1eb334 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -25721,7 +25721,7 @@ ix86_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
slightly less desirable, etc.).  */
 
 static int
-ix86_simd_clone_usable (struct cgraph_node *node)
+ix86_simd_clone_usable (struct cgraph_node *node, machine_mode)
 {
   switch (node->simdclone->vecsize_mangle)
 {
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 7e8e02e3f423..d7170f452068 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6531,11 +6531,11 @@ This hook should add implicit 
@code{attribute(target("..."))} attribute
 to SIMD clone @var{node} if needed.
 @end deftypefn
 
-@deftypefn {Target Hook} int TARGET_SIMD_CLONE_USABLE (struct cgraph_node 
*@var{})
+@deftypefn {Target Hook} int TARGET_SIMD_CLONE_USABLE (struct cgraph_node 
*@var{}, @var{machine_mode})
 This hook should return -1 if SIMD clone @var{node} shouldn't be used
-in vectorized loops in current function, or non-negative number if it is
-usable.  In that case, the smaller the number is, the more desirable it is
-to use it.
+in vectorized loops in current function with @var{vector_mode}, or
+non-negative number if it is usable.  In that case, the smaller the number
+is, the more desirable it is to use it.
 @end deftypefn
 
 @deftypefn {Target Hook} int TARGET_SIMT_VF (void)
diff --git a/gcc/target.def b/gcc/target.def
index 5ee33bf0cf91..8cf29c57eaee 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1645,10 +1645,10 @@ void, (struct cgraph_node *), NULL)
 DEFHOOK
 (usable,
 "This hook should return -1 if SIMD clone @var{node} shouldn't be used\n\
-in vectorized loops in current function, or non-negative number if it is\n\
-usable.  In that case, the smaller the number is, the more desirable it is\n\
-to use it.",
-int, (struct cgraph_node *), NULL)
+in vectorized loops in current function with @var{vector_mode}, or\n\
+non-ne

[gcc r15-6262] arm: fix bootstrap after MVE changes

2024-12-15 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:7b5599dbd75fe1ee7d861d4cfc6ea655a126bef3

commit r15-6262-g7b5599dbd75fe1ee7d861d4cfc6ea655a126bef3
Author: Tamar Christina 
Date:   Sun Dec 15 13:21:44 2024 +

arm: fix bootstrap after MVE changes

The recent commits for MVE on Saturday have broken armhf bootstrap due to a
-Werror false positive:

inlined from 'virtual rtx_def* 
{anonymous}::vstrq_scatter_base_impl::expand(arm_mve::function_expander&) 
const' at /gcc/config/arm/arm-mve-builtins-base.cc:352:17:
./genrtl.h:38:16: error: 'new_base' may be used uninitialized 
[-Werror=maybe-uninitialized]
   38 |   XEXP (rt, 1) = arg1;
/gcc/config/arm/arm-mve-builtins-base.cc: In member function 'virtual 
rtx_def* 
{anonymous}::vstrq_scatter_base_impl::expand(arm_mve::function_expander&) 
const':
/gcc/config/arm/arm-mve-builtins-base.cc:311:26: note: 'new_base' was 
declared here
  311 | rtx insns, base_ptr, new_base;
  |  ^~~~
In function 'rtx_def* init_rtx_fmt_ee(rtx, machine_mode, rtx, rtx)',
inlined from 'rtx_def* gen_rtx_fmt_ee_stat(rtx_code, machine_mode, rtx, 
rtx)' at ./genrtl.h:50:26,
inlined from 'virtual rtx_def* 
{anonymous}::vldrq_gather_base_impl::expand(arm_mve::function_expander&) const' 
at /gcc/config/arm/arm-mve-builtins-base.cc:527:17:
./genrtl.h:38:16: error: 'new_base' may be used uninitialized 
[-Werror=maybe-uninitialized]
   38 |   XEXP (rt, 1) = arg1;
/gcc/config/arm/arm-mve-builtins-base.cc: In member function 'virtual 
rtx_def* 
{anonymous}::vldrq_gather_base_impl::expand(arm_mve::function_expander&) const':
/gcc/config/arm/arm-mve-builtins-base.cc:486:26: note: 'new_base' was 
declared here
  486 | rtx insns, base_ptr, new_base;

To fix it I just initialize the value.

gcc/ChangeLog:

* config/arm/arm-mve-builtins-base.cc (expand): Initialize new_base.

Diff:
---
 gcc/config/arm/arm-mve-builtins-base.cc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/config/arm/arm-mve-builtins-base.cc 
b/gcc/config/arm/arm-mve-builtins-base.cc
index 723004b53d7b..ef3c504b1b30 100644
--- a/gcc/config/arm/arm-mve-builtins-base.cc
+++ b/gcc/config/arm/arm-mve-builtins-base.cc
@@ -308,7 +308,7 @@ public:
   rtx expand (function_expander &e) const override
   {
 insn_code icode;
-rtx insns, base_ptr, new_base;
+rtx insns, base_ptr, new_base = NULL_RTX;
 machine_mode base_mode;
 
 if ((e.mode_suffix_id != MODE_none)
@@ -483,7 +483,7 @@ public:
   rtx expand (function_expander &e) const override
   {
 insn_code icode;
-rtx insns, base_ptr, new_base;
+rtx insns, base_ptr, new_base = NULL_RTX;
 machine_mode base_mode;
 
 if ((e.mode_suffix_id != MODE_none)

[gcc r15-6217] AArch64: Set L1 data cache size according to size on CPUs

2024-12-13 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:6a5a1b8175e07ff578204476cd5d8a071cbc

commit r15-6217-g6a5a1b8175e07ff578204476cd5d8a071cbc
Author: Tamar Christina 
Date:   Fri Dec 13 11:20:18 2024 +

AArch64: Set L1 data cache size according to size on CPUs

This sets the L1 data cache size for some cores based on their size in their
Technical Reference Manuals.

Today the port minimum is 256 bytes as explained in commit
g:9a99559a478111f7fbeec29bd78344df7651c707, however like Neoverse V2 most 
cores
actually define the L1 cache size as 64-bytes.  The generic Armv9-A model 
was
already changed in g:f000cb8cbc58b23a91c84d47d69481904981a1d9 and this
change follows suite for a few other cores based on their TRMs.

This results in less memory pressure when running on large core count 
machines.

gcc/ChangeLog:

* config/aarch64/tuning_models/cortexx925.h: Set L1 cache size to 
64b.
* config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
* config/aarch64/tuning_models/neoversen1.h: Likewise.
* config/aarch64/tuning_models/neoversen2.h: Likewise.
* config/aarch64/tuning_models/neoversen3.h: Likewise.
* config/aarch64/tuning_models/neoversev1.h: Likewise.
* config/aarch64/tuning_models/neoversev2.h: Likewise.
(neoversev2_prefetch_tune): Removed.
* config/aarch64/tuning_models/neoversev3.h: Likewise.
* config/aarch64/tuning_models/neoversev3ae.h: Likewise.

Diff:
---
 gcc/config/aarch64/tuning_models/cortexx925.h |  2 +-
 gcc/config/aarch64/tuning_models/neoverse512tvb.h |  2 +-
 gcc/config/aarch64/tuning_models/neoversen1.h |  2 +-
 gcc/config/aarch64/tuning_models/neoversen2.h |  2 +-
 gcc/config/aarch64/tuning_models/neoversen3.h |  2 +-
 gcc/config/aarch64/tuning_models/neoversev1.h |  2 +-
 gcc/config/aarch64/tuning_models/neoversev2.h | 15 +--
 gcc/config/aarch64/tuning_models/neoversev3.h |  2 +-
 gcc/config/aarch64/tuning_models/neoversev3ae.h   |  2 +-
 9 files changed, 9 insertions(+), 22 deletions(-)

diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
b/gcc/config/aarch64/tuning_models/cortexx925.h
index ef4c7d1a8323..5ebaf66e986c 100644
--- a/gcc/config/aarch64/tuning_models/cortexx925.h
+++ b/gcc/config/aarch64/tuning_models/cortexx925.h
@@ -224,7 +224,7 @@ static const struct tune_params cortexx925_tunings =
| AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
| AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),   /* tune_flags.  */
-  &generic_prefetch_tune,
+  &generic_armv9a_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS   /* stp_policy_model.  */
 };
diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
index f72505918f3a..007f987154c4 100644
--- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
+++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
@@ -158,7 +158,7 @@ static const struct tune_params neoverse512tvb_tunings =
   (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
| AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),/* tune_flags.  */
-  &generic_prefetch_tune,
+  &generic_armv9a_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS   /* stp_policy_model.  */
 };
diff --git a/gcc/config/aarch64/tuning_models/neoversen1.h 
b/gcc/config/aarch64/tuning_models/neoversen1.h
index 3079eb2d9ec3..14b9ac9a734d 100644
--- a/gcc/config/aarch64/tuning_models/neoversen1.h
+++ b/gcc/config/aarch64/tuning_models/neoversen1.h
@@ -52,7 +52,7 @@ static const struct tune_params neoversen1_tunings =
   0,   /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE),   /* tune_flags.  */
-  &generic_prefetch_tune,
+  &generic_armv9a_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS/* stp_policy_model.  */
 };
diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
b/gcc/config/aarch64/tuning_models/neoversen2.h
index 141c994df381..32560d2f5f88 100644
--- a/gcc/config/aarch64/tuning_models/neoversen2.h
+++ b/gcc/config/aarch64/tuning_models/neoversen2.h
@@ -222,7 +222,7 @@ static const struct tune_params neoversen2_tunings =
| AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
| AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),   /* tune_flags.  */
-  &generic_prefetch_tune,
+  &generic_armv9a_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS   /* stp_policy_model.  */
 };
diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
b/gcc/config/aarch64/tuning_models/neoversen3.h
index b3e31

[gcc r15-6216] AArch64: Add CMP+CSEL and CMP+CSET for cores that support it

2024-12-13 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:4a9427f75b9f5dfbd9edd0ec8e0a07f868754b65

commit r15-6216-g4a9427f75b9f5dfbd9edd0ec8e0a07f868754b65
Author: Tamar Christina 
Date:   Fri Dec 13 11:17:55 2024 +

AArch64: Add CMP+CSEL and CMP+CSET for cores that support it

GCC 15 added two new fusions CMP+CSEL and CMP+CSET.

This patch enables them for cores that support based on their Software
Optimization Guides and generically on Armv9-A.   Even if a core does not
support it there's no negative performance impact.

gcc/ChangeLog:

* config/aarch64/aarch64-fusion-pairs.def 
(AARCH64_FUSE_NEOVERSE_BASE):
New.
* config/aarch64/tuning_models/neoverse512tvb.h: Use it.
* config/aarch64/tuning_models/neoversen2.h: Use it.
* config/aarch64/tuning_models/neoversen3.h: Use it.
* config/aarch64/tuning_models/neoversev1.h: Use it.
* config/aarch64/tuning_models/neoversev2.h: Use it.
* config/aarch64/tuning_models/neoversev3.h: Use it.
* config/aarch64/tuning_models/neoversev3ae.h: Use it.
* config/aarch64/tuning_models/cortexx925.h: Add fusions.
* config/aarch64/tuning_models/generic_armv9_a.h: Add fusions.

Diff:
---
 gcc/config/aarch64/aarch64-fusion-pairs.def| 4 
 gcc/config/aarch64/tuning_models/cortexx925.h  | 4 +++-
 gcc/config/aarch64/tuning_models/generic_armv9_a.h | 4 +++-
 gcc/config/aarch64/tuning_models/neoverse512tvb.h  | 2 +-
 gcc/config/aarch64/tuning_models/neoversen2.h  | 2 +-
 gcc/config/aarch64/tuning_models/neoversen3.h  | 2 +-
 gcc/config/aarch64/tuning_models/neoversev1.h  | 2 +-
 gcc/config/aarch64/tuning_models/neoversev2.h  | 2 +-
 gcc/config/aarch64/tuning_models/neoversev3.h  | 2 +-
 gcc/config/aarch64/tuning_models/neoversev3ae.h| 2 +-
 10 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-fusion-pairs.def 
b/gcc/config/aarch64/aarch64-fusion-pairs.def
index f8413ab0c802..0123430d988b 100644
--- a/gcc/config/aarch64/aarch64-fusion-pairs.def
+++ b/gcc/config/aarch64/aarch64-fusion-pairs.def
@@ -45,4 +45,8 @@ AARCH64_FUSION_PAIR ("cmp+cset", CMP_CSET)
 /* Baseline fusion settings suitable for all cores.  */
 #define AARCH64_FUSE_BASE (AARCH64_FUSE_CMP_BRANCH | AARCH64_FUSE_AES_AESMC)
 
+/* Baseline fusion settings suitable for all Neoverse cores.  */
+#define AARCH64_FUSE_NEOVERSE_BASE (AARCH64_FUSE_BASE | AARCH64_FUSE_CMP_CSEL \
+   | AARCH64_FUSE_CMP_CSET)
+
 #define AARCH64_FUSE_MOVK (AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_MOVK_MOVK)
diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
b/gcc/config/aarch64/tuning_models/cortexx925.h
index b2ff716157a4..ef4c7d1a8323 100644
--- a/gcc/config/aarch64/tuning_models/cortexx925.h
+++ b/gcc/config/aarch64/tuning_models/cortexx925.h
@@ -205,7 +205,9 @@ static const struct tune_params cortexx925_tunings =
 2 /* store_pred.  */
   }, /* memmov_cost.  */
   10, /* issue_rate  */
-  AARCH64_FUSE_BASE, /* fusible_ops  */
+  (AARCH64_FUSE_BASE
+   | AARCH64_FUSE_CMP_CSEL
+   | AARCH64_FUSE_CMP_CSET), /* fusible_ops  */
   "32:16", /* function_align.  */
   "4", /* jump_align.  */
   "32:16", /* loop_align.  */
diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
index a05a9ab92a27..785e00946bc4 100644
--- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
+++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
@@ -236,7 +236,9 @@ static const struct tune_params generic_armv9_a_tunings =
 1 /* store_pred.  */
   }, /* memmov_cost.  */
   3, /* issue_rate  */
-  AARCH64_FUSE_BASE, /* fusible_ops  */
+  (AARCH64_FUSE_BASE
+   | AARCH64_FUSE_CMP_CSEL
+   | AARCH64_FUSE_CMP_CSET), /* fusible_ops  */
   "32:16", /* function_align.  */
   "4", /* jump_align.  */
   "32:16", /* loop_align.  */
diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
index c407b89a22f1..f72505918f3a 100644
--- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
+++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
@@ -143,7 +143,7 @@ static const struct tune_params neoverse512tvb_tunings =
 1 /* store_pred.  */
   }, /* memmov_cost.  */
   3, /* issue_rate  */
-  AARCH64_FUSE_BASE, /* fusible_ops  */
+  AARCH64_FUSE_NEOVERSE_BASE, /* fusible_ops  */
   "32:16", /* function_align.  */
   "4", /* jump_align.  */
   "32:16", /* loop_align.  */
diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
b/gcc/config/aarch64/tuning_models/neoversen2.h
index fd5f8f373705..141c994df381 100644
--- a/gcc/config/aarch64/tuning_models/neoversen2.h
+++ b/gcc/config/aarch64/tuning_models/neoversen2.h
@@ -205,7 +205,7 @@ static const struct tune_params neoversen2_tunings =
 1 /* store_pred.  */
   }, /* memmov_cost.  */
   5, /* issue_rate  */

[gcc r15-6391] AArch64: Disable `omp declare variant' tests for aarch64 [PR96342]

2024-12-20 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:6ecb365d4c3f36eaf684c38fc5d9008a1409c725

commit r15-6391-g6ecb365d4c3f36eaf684c38fc5d9008a1409c725
Author: Tamar Christina 
Date:   Fri Dec 20 14:25:50 2024 +

AArch64: Disable `omp declare variant' tests for aarch64 [PR96342]

These tests are x86 specific and shouldn't be run for aarch64.

gcc/testsuite/ChangeLog:

PR target/96342
* c-c++-common/gomp/declare-variant-14.c: Make i?86 and x86_64 
target
only test.
* gfortran.dg/gomp/declare-variant-14.f90: Likewise.

Diff:
---
 gcc/testsuite/c-c++-common/gomp/declare-variant-14.c  | 13 +
 gcc/testsuite/gfortran.dg/gomp/declare-variant-14.f90 | 11 ---
 2 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/gcc/testsuite/c-c++-common/gomp/declare-variant-14.c 
b/gcc/testsuite/c-c++-common/gomp/declare-variant-14.c
index e3668893afe3..8a6bf09d3cf6 100644
--- a/gcc/testsuite/c-c++-common/gomp/declare-variant-14.c
+++ b/gcc/testsuite/c-c++-common/gomp/declare-variant-14.c
@@ -1,6 +1,5 @@
-/* { dg-do compile { target vect_simd_clones } } */
-/* { dg-additional-options "-fdump-tree-gimple -fdump-tree-optimized" } */
-/* { dg-additional-options "-mno-sse3" { target { i?86-*-* x86_64-*-* } } } */
+/* { dg-do compile { target { { i?86-*-* x86_64-*-* } && vect_simd_clones } } 
} */
+/* { dg-additional-options "-mno-sse3 -fdump-tree-gimple 
-fdump-tree-optimized" } */
 
 int f01 (int);
 int f02 (int);
@@ -15,15 +14,13 @@ int
 test1 (int x)
 {
   /* At gimplification time, we can't decide yet which function to call.  */
-  /* { dg-final { scan-tree-dump-times "f04 \\\(x" 2 "gimple" { target { 
!aarch64*-*-* } } } } */
+  /* { dg-final { scan-tree-dump-times "f04 \\\(x" 2 "gimple" } } */
   /* After simd clones are created, the original non-clone test1 shall
  call f03 (score 6), the sse2/avx/avx2 clones too, but avx512f clones
  shall call f01 with score 8.  */
   /* { dg-final { scan-tree-dump-not "f04 \\\(x" "optimized" } } */
-  /* { dg-final { scan-tree-dump-times "f03 \\\(x" 14 "optimized" { target { 
!aarch64*-*-* } } } } */
-  /* { dg-final { scan-tree-dump-times "f03 \\\(x" 10 "optimized" { target { 
aarch64*-*-* } } } } */
-  /* { dg-final { scan-tree-dump-times "f01 \\\(x" 4 "optimized" { target { 
!aarch64*-*-* } } } } */
-  /* { dg-final { scan-tree-dump-times "f01 \\\(x" 0 "optimized" { target { 
aarch64*-*-* } } } } */
+  /* { dg-final { scan-tree-dump-times "f03 \\\(x" 14 "optimized" } } */
+  /* { dg-final { scan-tree-dump-times "f01 \\\(x" 4 "optimized" } } */
   int a = f04 (x);
   int b = f04 (x);
   return a + b;
diff --git a/gcc/testsuite/gfortran.dg/gomp/declare-variant-14.f90 
b/gcc/testsuite/gfortran.dg/gomp/declare-variant-14.f90
index 6319df0558f3..e154d93d73a5 100644
--- a/gcc/testsuite/gfortran.dg/gomp/declare-variant-14.f90
+++ b/gcc/testsuite/gfortran.dg/gomp/declare-variant-14.f90
@@ -1,6 +1,5 @@
-! { dg-do compile { target vect_simd_clones } }
-! { dg-additional-options "-O0 -fdump-tree-gimple -fdump-tree-optimized" }
-! { dg-additional-options "-mno-sse3" { target { i?86-*-* x86_64-*-* } } }
+! { dg-do compile { target { { i?86-*-* x86_64-*-* } && vect_simd_clones } } } 
*/
+! { dg-additional-options "-mno-sse3 -O0 -fdump-tree-gimple 
-fdump-tree-optimized" }
 
 module main
   implicit none
@@ -40,10 +39,8 @@ contains
 ! call f03 (score 6), the sse2/avx/avx2 clones too, but avx512f clones
 ! shall call f01 with score 8.
 ! { dg-final { scan-tree-dump-not "f04 \\\(x" "optimized" } }
-! { dg-final { scan-tree-dump-times "f03 \\\(x" 14 "optimized" { target { 
!aarch64*-*-* } } } }
-! { dg-final { scan-tree-dump-times "f03 \\\(x" 6 "optimized" { target { 
aarch64*-*-* } } } }
-! { dg-final { scan-tree-dump-times "f01 \\\(x" 4 "optimized" { target { 
!aarch64*-*-* } } } }
-! { dg-final { scan-tree-dump-times "f01 \\\(x" 0 "optimized" { target { 
aarch64*-*-* } } } }
+! { dg-final { scan-tree-dump-times "f03 \\\(x" 14 "optimized" } }
+! { dg-final { scan-tree-dump-times "f01 \\\(x" 4 "optimized" } }
 a = f04 (x)
 b = f04 (x)
 test1 = a + b

[gcc r15-6392] AArch64: Add SVE support for simd clones [PR96342]

2024-12-20 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:d7d3dfe7a2a26e370805ddf835bfd00c51d32f1b

commit r15-6392-gd7d3dfe7a2a26e370805ddf835bfd00c51d32f1b
Author: Tamar Christina 
Date:   Fri Dec 20 14:27:25 2024 +

AArch64: Add SVE support for simd clones [PR96342]

This patch finalizes adding support for the generation of SVE simd clones 
when
no simdlen is provided, following the ABI rules where the widest data type
determines the minimum amount of elements in a length agnostic vector.

gcc/ChangeLog:

PR target/96342
* config/aarch64/aarch64-protos.h (add_sve_type_attribute): Declare.
* config/aarch64/aarch64-sve-builtins.cc (add_sve_type_attribute): 
Make
visibility global and support use for non_acle types.
* config/aarch64/aarch64.cc
(aarch64_simd_clone_compute_vecsize_and_simdlen): Create VLA simd 
clone
when no simdlen is provided, according to ABI rules.
(simd_clone_adjust_sve_vector_type): New helper function.
(aarch64_simd_clone_adjust): Add '+sve' attribute to SVE simd clones
and modify types to use SVE types.
* omp-simd-clone.cc (simd_clone_mangle): Print 'x' for VLA simdlen.
(simd_clone_adjust): Adapt safelen check to be compatible with VLA
simdlen.

gcc/testsuite/ChangeLog:

PR target/96342
* gcc.target/aarch64/declare-simd-2.c: Add SVE clone scan.
* gcc.target/aarch64/vect-simd-clone-1.c: New test.
* g++.target/aarch64/vect-simd-clone-1.C: New test.

Co-authored-by: Victor Do Nascimento 
Co-authored-by: Tamar Christina 

Diff:
---
 gcc/config/aarch64/aarch64-protos.h|   2 +
 gcc/config/aarch64/aarch64-sve-builtins.cc |   9 +-
 gcc/config/aarch64/aarch64.cc  | 175 +
 gcc/omp-simd-clone.cc  |  13 +-
 .../g++.target/aarch64/vect-simd-clone-1.C |  88 +++
 gcc/testsuite/gcc.target/aarch64/declare-simd-2.c  |   1 +
 .../gcc.target/aarch64/vect-simd-clone-1.c |  89 +++
 7 files changed, 342 insertions(+), 35 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index bd17486e9128..7ab1316cf568 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1151,6 +1151,8 @@ namespace aarch64_sve {
 #ifdef GCC_TARGET_H
   bool verify_type_context (location_t, type_context_kind, const_tree, bool);
 #endif
+ void add_sve_type_attribute (tree, unsigned int, unsigned int,
+ const char *, const char *);
 }
 
 extern void aarch64_split_combinev16qi (rtx operands[3]);
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 5acc56f99c65..e93c3a78e6d6 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -1032,15 +1032,18 @@ static GTY(()) hash_map 
*overload_names[2];
 
 /* Record that TYPE is an ABI-defined SVE type that contains NUM_ZR SVE vectors
and NUM_PR SVE predicates.  MANGLED_NAME, if nonnull, is the ABI-defined
-   mangling of the type.  ACLE_NAME is the  name of the type.  */
-static void
+   mangling of the type.  mangling of the type.  ACLE_NAME is the 
+   name of the type, or null if  does not provide the type.  */
+void
 add_sve_type_attribute (tree type, unsigned int num_zr, unsigned int num_pr,
const char *mangled_name, const char *acle_name)
 {
   tree mangled_name_tree
 = (mangled_name ? get_identifier (mangled_name) : NULL_TREE);
+  tree acle_name_tree
+= (acle_name ? get_identifier (acle_name) : NULL_TREE);
 
-  tree value = tree_cons (NULL_TREE, get_identifier (acle_name), NULL_TREE);
+  tree value = tree_cons (NULL_TREE, acle_name_tree, NULL_TREE);
   value = tree_cons (NULL_TREE, mangled_name_tree, value);
   value = tree_cons (NULL_TREE, size_int (num_pr), value);
   value = tree_cons (NULL_TREE, size_int (num_zr), value);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 77a2a6bfa3a3..de4c0a078391 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -29323,7 +29323,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
int num, bool explicit_p)
 {
   tree t, ret_type;
-  unsigned int nds_elt_bits;
+  unsigned int nds_elt_bits, wds_elt_bits;
   unsigned HOST_WIDE_INT const_simdlen;
 
   if (!TARGET_SIMD)
@@ -29368,10 +29368,14 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
(struct cgraph_node *node,
   if (TREE_CODE (ret_type) != VOID_TYPE)
 {
   nds_elt_bits = lane_size (SIMD_CLONE_ARG_TYPE_VECTOR, ret_type);
+  wds_elt_bits = nds_elt_bits;
   vec_elts.safe_push (std::make_pair (ret_type, nds_elt_bits));
 }
   else
-nds_elt_bits = POINT

[gcc r15-6393] AArch64: Implement vector concat of partial SVE vectors [PR96342]

2024-12-20 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:89b2c7dc96c4944c306131b665a4738a8a99413e

commit r15-6393-g89b2c7dc96c4944c306131b665a4738a8a99413e
Author: Tamar Christina 
Date:   Fri Dec 20 14:34:32 2024 +

AArch64: Implement vector concat of partial SVE vectors [PR96342]

This patch adds support for vector constructor from two partial SVE vectors 
into
a full SVE vector. It also implements support for the standard vec_init 
obtab to
do this.

gcc/ChangeLog:

PR target/96342
* config/aarch64/aarch64-protos.h
(aarch64_sve_expand_vector_init_subvector): New.
* config/aarch64/aarch64-sve.md (vec_init): New.
(@aarch64_pack_partial): New.
* config/aarch64/aarch64.cc 
(aarch64_sve_expand_vector_init_subvector): New.
* config/aarch64/iterators.md (SVE_NO2E): New.
(VHALF, Vhalf): Add SVE partial vectors.

gcc/testsuite/ChangeLog:

PR target/96342
* gcc.target/aarch64/vect-simd-clone-2.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-protos.h|  1 +
 gcc/config/aarch64/aarch64-sve.md  | 23 +
 gcc/config/aarch64/aarch64.cc  | 24 ++
 gcc/config/aarch64/iterators.md| 20 --
 .../gcc.target/aarch64/vect-simd-clone-2.c | 13 
 5 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 7ab1316cf568..18764e407c13 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1028,6 +1028,7 @@ rtx aarch64_replace_reg_mode (rtx, machine_mode);
 void aarch64_split_sve_subreg_move (rtx, rtx, rtx);
 void aarch64_expand_prologue (void);
 void aarch64_expand_vector_init (rtx, rtx);
+void aarch64_sve_expand_vector_init_subvector (rtx, rtx);
 void aarch64_sve_expand_vector_init (rtx, rtx);
 void aarch64_init_cumulative_args (CUMULATIVE_ARGS *, const_tree, rtx,
   const_tree, unsigned, bool = false);
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index a72ca2a500d3..6659bb4fcab3 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -2839,6 +2839,16 @@
   }
 )
 
+(define_expand "vec_init"
+  [(match_operand:SVE_NO2E 0 "register_operand")
+   (match_operand 1 "")]
+  "TARGET_SVE"
+  {
+aarch64_sve_expand_vector_init_subvector (operands[0], operands[1]);
+DONE;
+  }
+)
+
 ;; Shift an SVE vector left and insert a scalar into element 0.
 (define_insn "vec_shl_insert_"
   [(set (match_operand:SVE_FULL 0 "register_operand")
@@ -9289,6 +9299,19 @@
   "uzp1\t%0., %1., %2."
 )
 
+;; Integer partial pack packing two partial SVE types into a single full SVE
+;; type of the same element type.  Use UZP1 on the wider type, which discards
+;; the high part of each wide element.  This allows to concat SVE partial types
+;; into a wider vector.
+(define_insn "@aarch64_pack_partial"
+  [(set (match_operand:SVE_NO2E 0 "register_operand" "=w")
+   (vec_concat:SVE_NO2E
+ (match_operand: 1 "register_operand" "w")
+ (match_operand: 2 "register_operand" "w")))]
+  "TARGET_SVE"
+  "uzp1\t%0., %1., %2."
+)
+
 ;; -
 ;;  [INT<-INT] Unpacks
 ;; -
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index de4c0a078391..41cc2eeec9a4 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -24870,6 +24870,30 @@ aarch64_sve_expand_vector_init (rtx target, rtx vals)
 aarch64_sve_expand_vector_init_insert_elems (target, v, nelts);
 }
 
+/* Initialize register TARGET from the two vector subelements in PARALLEL
+   rtx VALS.  */
+
+void
+aarch64_sve_expand_vector_init_subvector (rtx target, rtx vals)
+{
+  machine_mode mode = GET_MODE (target);
+  int nelts = XVECLEN (vals, 0);
+
+  gcc_assert (nelts == 2);
+
+  rtx arg0 = XVECEXP (vals, 0, 0);
+  rtx arg1 = XVECEXP (vals, 0, 1);
+
+  /* If we have two elements and are concatting vector.  */
+  machine_mode elem_mode = GET_MODE (arg0);
+  gcc_assert (VECTOR_MODE_P (elem_mode));
+
+  arg0 = force_reg (elem_mode, arg0);
+  arg1 = force_reg (elem_mode, arg1);
+  emit_insn (gen_aarch64_pack_partial (mode, target, arg0, arg1));
+  return;
+}
+
 /* Check whether VALUE is a vector constant in which every element
is either a power of 2 or a negated power of 2.  If so, return
a constant vector of log2s, and flip CODE between PLUS and MINUS
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 89c72b24aeb7..34200b05a3ab 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -140,6 +140,10 @@
 ;; VQ without 2 element modes.
 (define_

[gcc r15-5565] middle-end: Pass along SLP node when costing vector loads/stores

2024-11-21 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:dbc38dd9e96a9995298da2478041bdbbf247c479

commit r15-5565-gdbc38dd9e96a9995298da2478041bdbbf247c479
Author: Tamar Christina 
Date:   Thu Nov 21 12:49:35 2024 +

middle-end: Pass along SLP node when costing vector loads/stores

With the support to SLP only we now pass the VMAT through the SLP node, 
however
the majority of the costing calls inside vectorizable_load and
vectorizable_store do no pass the SLP node along.  Due to this the backend 
costing
never sees the VMAT for these cases anymore.

Additionally the helper around record_stmt_cost when both SLP and 
stmt_vinfo are
passed would only pass the SLP node along.  However the SLP node doesn't 
contain
all the info available in the stmt_vinfo and we'd have to go through the
SLP_TREE_REPRESENTATIVE anyway.  As such I changed the function to just 
Always
pass both along.  Unlike the VMAT changes, I don't believe there to be a
correctness issue here but would minimize the number of churn in the backend
costing until vectorizer costing as a whole is revisited in GCC 16.

These changes re-enable the cost model on AArch64 and also correctly find 
the
VMATs on loads and stores fixing testcases such as sve_iters_low_2.c.

gcc/ChangeLog:

* tree-vect-data-refs.cc (vect_get_data_access_cost): Pass NULL for 
SLP
node.
* tree-vect-stmts.cc (record_stmt_cost): Expose.
(vect_get_store_cost, vect_get_load_cost): Extend with SLP node.
(vectorizable_store, vectorizable_load): Pass SLP node to all 
costing.
* tree-vectorizer.h (record_stmt_cost): Always pass both SLP node 
and
stmt_vinfo to costing.
(vect_get_load_cost, vect_get_store_cost): Extend with SLP node.

Diff:
---
 gcc/tree-vect-data-refs.cc |  12 ++---
 gcc/tree-vect-stmts.cc | 109 +
 gcc/tree-vectorizer.h  |  16 +++
 3 files changed, 76 insertions(+), 61 deletions(-)

diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index a32343c0022b..35c946ab2d4e 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -1729,12 +1729,14 @@ vect_get_data_access_cost (vec_info *vinfo, dr_vec_info 
*dr_info,
 ncopies = vect_get_num_copies (loop_vinfo, STMT_VINFO_VECTYPE (stmt_info));
 
   if (DR_IS_READ (dr_info->dr))
-vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
-   misalignment, true, inside_cost,
-   outside_cost, prologue_cost_vec, body_cost_vec, false);
+vect_get_load_cost (vinfo, stmt_info, NULL, ncopies,
+   alignment_support_scheme, misalignment, true,
+   inside_cost, outside_cost, prologue_cost_vec,
+   body_cost_vec, false);
   else
-vect_get_store_cost (vinfo,stmt_info, ncopies, alignment_support_scheme,
-misalignment, inside_cost, body_cost_vec);
+vect_get_store_cost (vinfo,stmt_info, NULL, ncopies,
+alignment_support_scheme, misalignment, inside_cost,
+body_cost_vec);
 
   if (dump_enabled_p ())
 dump_printf_loc (MSG_NOTE, vect_location,
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 75973c77236e..e500902a8be9 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -93,7 +93,7 @@ stmt_in_inner_loop_p (vec_info *vinfo, class _stmt_vec_info 
*stmt_info)
target model or by saving it in a vector for later processing.
Return a preliminary estimate of the statement's cost.  */
 
-static unsigned
+unsigned
 record_stmt_cost (stmt_vector_for_cost *body_cost_vec, int count,
  enum vect_cost_for_stmt kind,
  stmt_vec_info stmt_info, slp_tree node,
@@ -1008,8 +1008,8 @@ cfun_returns (tree decl)
 
 /* Calculate cost of DR's memory access.  */
 void
-vect_get_store_cost (vec_info *, stmt_vec_info stmt_info, int ncopies,
-dr_alignment_support alignment_support_scheme,
+vect_get_store_cost (vec_info *, stmt_vec_info stmt_info, slp_tree slp_node,
+int ncopies, dr_alignment_support alignment_support_scheme,
 int misalignment,
 unsigned int *inside_cost,
 stmt_vector_for_cost *body_cost_vec)
@@ -1019,7 +1019,7 @@ vect_get_store_cost (vec_info *, stmt_vec_info stmt_info, 
int ncopies,
 case dr_aligned:
   {
*inside_cost += record_stmt_cost (body_cost_vec, ncopies,
- vector_store, stmt_info, 0,
+ vector_store, stmt_info, slp_node, 0,
  vect_body);
 
 if (dump_enabled_p ())
@@ -1032,7 +1032,7 @@ vect_get_store_cost (vec_info *, stmt_vec_info stmt_info, 
int ncopies,
   {
 /* Here, we assign an additi

[gcc r15-6752] AArch64: Fix costing of emulated gathers/scatters [PR118188]

2025-01-09 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:08b6e875c6b1b52c6e98f4a2e37124bf8c6a6ccb

commit r15-6752-g08b6e875c6b1b52c6e98f4a2e37124bf8c6a6ccb
Author: Tamar Christina 
Date:   Thu Jan 9 21:31:05 2025 +

AArch64: Fix costing of emulated gathers/scatters [PR118188]

When a target does not support gathers and scatters the vectorizer tries to
emulate these using scalar loads/stores and a reconstruction of vectors from
scalar.

The loads are still marked with VMAT_GATHER_SCATTER to indicate that they 
are
gather/scatters, however the vectorizer also asks the target to cost the
instruction that generates the indexes for the emulated instructions.

This is done by asking the target to cost vec_to_scalar and vec_construct 
with
a stmt_vinfo being the VMAT_GATHER_SCATTER.

Since Adv. SIMD does not have an LD1 variant that takes an Adv. SIMD Scalar
element the operation is lowered entirely into a sequence of GPR loads to 
create
the x registers for the indexes.

At the moment however we don't cost these, and so the vectorizer things that
when it emulates the instructions that it's much cheaper than using an 
actual
gather/scatter with SVE.  Consider:

#define iterations 10
#define LEN_1D 32000

float a[LEN_1D], b[LEN_1D];

float
s4115 (int *ip)
{
float sum = 0.;
for (int i = 0; i < LEN_1D; i++)
{
sum += a[i] * b[ip[i]];
}
return sum;
}

which before this patch with -mcpu= generates:

.L2:
add x3, x0, x1
ldrsw   x4, [x0, x1]
ldrsw   x6, [x3, 4]
ldpsw   x3, x5, [x3, 8]
ldr s1, [x2, x4, lsl 2]
ldr s30, [x2, x6, lsl 2]
ldr s31, [x2, x5, lsl 2]
ldr s29, [x2, x3, lsl 2]
uzp1v30.2s, v30.2s, v31.2s
ldr q31, [x7, x1]
add x1, x1, 16
uzp1v1.2s, v1.2s, v29.2s
zip1v30.4s, v1.4s, v30.4s
fmlav0.4s, v31.4s, v30.4s
cmp x1, x8
bne .L2

but during costing:

a[i_18] 1 times vector_load costs 4 in body
*_4 1 times unaligned_load (misalign -1) costs 4 in body
b[_5] 4 times vec_to_scalar costs 32 in body
b[_5] 4 times scalar_load costs 16 in body
b[_5] 1 times vec_construct costs 3 in body
_1 * _6 1 times vector_stmt costs 2 in body
_7 + sum_16 1 times scalar_to_vec costs 4 in prologue
_7 + sum_16 1 times vector_stmt costs 2 in epilogue
_7 + sum_16 1 times vec_to_scalar costs 4 in epilogue
_7 + sum_16 1 times vector_stmt costs 2 in body

Here we see that the latency for the vec_to_scalar is very high.  We know 
the
intermediate vector isn't usable by the target ISA and will always be 
elided.
However these latencies need to remain high because when costing 
gather/scatters
IFNs we still pass the nunits of the type along.  In other words, the 
vectorizer
is still costing vector gather/scatters as scalar load/stores.

Lowering the cost for the emulated gathers would result in emulation being
seemingly cheaper.  So while the emulated costs are very high, they need to 
be
higher than those for the IFN costing.

i.e. the vectorizer generates:

  vect__5.9_8 = MEM  [(intD.7 *)vectp_ip.7_14];
  _35 = BIT_FIELD_REF ;
  _36 = (sizetype) _35;
  _37 = _36 * 4;
  _38 = _34 + _37;
  _39 = (voidD.55 *) _38;
  # VUSE <.MEM_10(D)>
  _40 = MEM[(floatD.32 *)_39];

which after IVopts is:

  _63 = &MEM  [(int *)ip_11(D) + ivtmp.19_27 * 1];
  _47 = BIT_FIELD_REF  [(int *)_63], 32, 64>;
  _41 = BIT_FIELD_REF  [(int *)_63], 32, 32>;
  _35 = BIT_FIELD_REF  [(int *)_63], 32, 0>;
  _53 = BIT_FIELD_REF  [(int *)_63], 32, 96>;

Which we correctly lower in RTL to individual loads to avoid the repeated 
umov.

As such, we should cost the vec_to_scalar as GPR loads and also do so for 
the
throughput which we at the moment cost as:

  note:  Vector issue estimate:
  note:load operations = 6
  note:store operations = 0
  note:general operations = 6
  note:reduction latency = 2
  note:estimated min cycles per iteration = 2.00

Which means 3 loads for the GOR indexes are missing, making it seem like the
emulated loop has a much lower cycles per iter than it actually does since 
the
bottleneck on the load units are not modelled.

But worse, because the vectorizer costs gathers/scatters IFNs as scalar
load/stores the number of loads required for an SVE gather is always much
higher than the equivalent emulated variant.

gcc/ChangeLog:

PR target/118188
* config/aarch64/aarch64.cc (aarch64_vector_costs::count_ops): 
Adjust
throughput of emu

[gcc r14-11199] AArch64: correct Cortex-X4 MIDR

2025-01-10 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:26f78a4249b051c7755a44ba1ab1743f4133b0c2

commit r14-11199-g26f78a4249b051c7755a44ba1ab1743f4133b0c2
Author: Tamar Christina 
Date:   Fri Jan 10 21:33:57 2025 +

AArch64: correct Cortex-X4 MIDR

The Parts Num field for the MIDR for Cortex-X4 is wrong.  It's currently the
parts number for a Cortex-A720 (which does have the right number).

The correct number can be found in the Cortex-X4 Technical Reference Manual 
[1]
on page 382 in Issue Number 5.

[1] https://developer.arm.com/documentation/102484/latest/

gcc/ChangeLog:

* config/aarch64/aarch64-cores.def (AARCH64_CORE): Fix cortex-x4 
parts
num.

Diff:
---
 gcc/config/aarch64/aarch64-cores.def | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index a919ab7d8a5a..b1eaf5512b57 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -185,7 +185,7 @@ AARCH64_CORE("cortex-x2",  cortexx2, cortexa57, V9A,  
(SVE2_BITPERM, MEMTAG, I8M
 
 AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
I8MM, BF16), neoversen2, 0x41, 0xd4e, -1)
 
-AARCH64_CORE("cortex-x4",  cortexx4, cortexa57, V9_2A,  (SVE2_BITPERM, MEMTAG, 
PROFILE), neoversen2, 0x41, 0xd81, -1)
+AARCH64_CORE("cortex-x4",  cortexx4, cortexa57, V9_2A,  (SVE2_BITPERM, MEMTAG, 
PROFILE), neoversen2, 0x41, 0xd82, -1)
 
 AARCH64_CORE("cortex-x925", cortexx925, cortexa57, V9_2A,  (SVE2_BITPERM, 
MEMTAG, PROFILE), neoversen2, 0x41, 0xd85, -1)

[gcc r15-7094] aarch64: Drop ILP32 from default elf multilibs after deprecation

2025-01-21 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:9fd190c70976638eb8ae239f09d9f73da26d3021

commit r15-7094-g9fd190c70976638eb8ae239f09d9f73da26d3021
Author: Tamar Christina 
Date:   Tue Jan 21 10:27:13 2025 +

aarch64: Drop ILP32 from default elf multilibs after deprecation

Following the deprecation of ILP32 *-elf builds fail now due to -Werror on 
the
deprecation warning.  This is because on embedded builds ILP32 is part of 
the
default multilib.

This patch removed it from the default target as the build would fail 
anyway.

gcc/ChangeLog:

* config.gcc (aarch64-*-elf): Drop ILP32 from default multilibs.

Diff:
---
 gcc/config.gcc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config.gcc b/gcc/config.gcc
index c0e66a26f953..6f9f7313e132 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -1210,7 +1210,7 @@ aarch64*-*-elf | aarch64*-*-fuchsia* | aarch64*-*-rtems*)
esac
aarch64_multilibs="${with_multilib_list}"
if test "$aarch64_multilibs" = "default"; then
-   aarch64_multilibs="lp64,ilp32"
+   aarch64_multilibs="lp64"
fi
aarch64_multilibs=`echo $aarch64_multilibs | sed -e 's/,/ /g'`
for aarch64_multilib in ${aarch64_multilibs}; do

[gcc r15-7018] AArch64: Use standard names for saturating arithmetic

2025-01-18 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:aa361611490947eb228e5b625a3f0f23ff647dbd

commit r15-7018-gaa361611490947eb228e5b625a3f0f23ff647dbd
Author: Akram Ahmad 
Date:   Fri Jan 17 17:43:49 2025 +

AArch64: Use standard names for saturating arithmetic

This renames the existing {s,u}q{add,sub} instructions to use the
standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and
IFN_SAT_SUB.

The NEON intrinsics for saturating arithmetic and their corresponding
builtins are changed to use these standard names too.

Using the standard names for the instructions causes 32 and 64-bit
unsigned scalar saturating arithmetic to use the NEON instructions,
resulting in an additional (and inefficient) FMOV to be generated when
the original operands are in GP registers. This patch therefore also
restores the original behaviour of using the adds/subs instructions
in this circumstance.

Additional tests are written for the scalar and Adv. SIMD cases to
ensure that the correct instructions are used. The NEON intrinsics are
already tested elsewhere.

gcc/ChangeLog:

* config/aarch64/aarch64-builtins.cc: Expand iterators.
* config/aarch64/aarch64-simd-builtins.def: Use standard names
* config/aarch64/aarch64-simd.md: Use standard names, split insn
definitions on signedness of operator and type of operands.
* config/aarch64/arm_neon.h: Use standard builtin names.
* config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to
simplify splitting of insn for unsigned scalar arithmetic.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/scalar_intrinsics.c: Update testcases.
* 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc:
Template file for unsigned vector saturating arithmetic tests.
* 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c:
8-bit vector type tests.
* 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c:
16-bit vector type tests.
* 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c:
32-bit vector type tests.
* 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c:
64-bit vector type tests.
* gcc.target/aarch64/saturating_arithmetic.inc: Template file
for scalar saturating arithmetic tests.
* gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests.
* gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests.
* gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests.
* gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests.

Co-authored-by: Tamar Christina 

Diff:
---
 gcc/config/aarch64/aarch64-builtins.cc |  12 +
 gcc/config/aarch64/aarch64-simd-builtins.def   |   8 +-
 gcc/config/aarch64/aarch64-simd.md | 207 +++-
 gcc/config/aarch64/arm_neon.h  |  96 
 gcc/config/aarch64/iterators.md|   4 +
 .../saturating_arithmetic_autovect.inc |  58 +
 .../saturating_arithmetic_autovect_1.c |  79 ++
 .../saturating_arithmetic_autovect_2.c |  79 ++
 .../saturating_arithmetic_autovect_3.c |  75 ++
 .../saturating_arithmetic_autovect_4.c |  77 ++
 .../aarch64/saturating-arithmetic-signed.c | 270 +
 .../gcc.target/aarch64/saturating_arithmetic.inc   |  39 +++
 .../gcc.target/aarch64/saturating_arithmetic_1.c   |  36 +++
 .../gcc.target/aarch64/saturating_arithmetic_2.c   |  36 +++
 .../gcc.target/aarch64/saturating_arithmetic_3.c   |  30 +++
 .../gcc.target/aarch64/saturating_arithmetic_4.c   |  30 +++
 .../gcc.target/aarch64/scalar_intrinsics.c |  32 +--
 17 files changed, 1096 insertions(+), 72 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
b/gcc/config/aarch64/aarch64-builtins.cc
index 86eebc168859..6d5479c2e449 100644
--- a/gcc/config/aarch64/aarch64-builtins.cc
+++ b/gcc/config/aarch64/aarch64-builtins.cc
@@ -5039,6 +5039,18 @@ aarch64_general_gimple_fold_builtin (unsigned int fcode, 
gcall *stmt,
  new_stmt = gimple_build_assign (gimple_call_lhs (stmt),
  LSHIFT_EXPR, args[0], args[1]);
break;
+  /* lower saturating add/sub neon builtins to gimple.  */
+  BUILTIN_VSDQ_I (BINOP, ssadd, 3, DEFAULT)
+  BUILTIN_VSDQ_I (BINOPU, usadd, 3, DEFAULT)
+   new_stmt = gimple_build_call_internal (IFN_SAT_ADD, 2, args[0], 
args[1]);
+   gimple_call_set_lhs (new_stmt, gimple_call_lhs (stmt));
+   break;
+  BUILTIN_VSDQ_I (BINOP, sssub, 3, DEFAULT)
+  BUILTIN_VSDQ_I (BINOPU, ussub, 3, DEFAULT)
+   new_stmt = gimple_build_call_intern

[gcc r15-7015] Revert "AArch64: Use standard names for SVE saturating arithmetic"

2025-01-18 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:8787f63de6e51bc43f86bb08c8a5f4a370246a90

commit r15-7015-g8787f63de6e51bc43f86bb08c8a5f4a370246a90
Author: Tamar Christina 
Date:   Sat Jan 18 11:12:35 2025 +

Revert "AArch64: Use standard names for SVE saturating arithmetic"

This reverts commit 26b2d9f27ca24f0705641a85f29d179fa0600869.

Diff:
---
 gcc/config/aarch64/aarch64-sve.md  |  4 +-
 .../aarch64/sve/saturating_arithmetic.inc  | 68 --
 .../aarch64/sve/saturating_arithmetic_1.c  | 60 ---
 .../aarch64/sve/saturating_arithmetic_2.c  | 60 ---
 .../aarch64/sve/saturating_arithmetic_3.c  | 62 
 .../aarch64/sve/saturating_arithmetic_4.c  | 62 
 6 files changed, 2 insertions(+), 314 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index e975286a0190..ba4b4d904c77 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -4449,7 +4449,7 @@
 ;; -
 
 ;; Unpredicated saturating signed addition and subtraction.
-(define_insn "s3"
+(define_insn "@aarch64_sve_"
   [(set (match_operand:SVE_FULL_I 0 "register_operand")
(SBINQOPS:SVE_FULL_I
  (match_operand:SVE_FULL_I 1 "register_operand")
@@ -4465,7 +4465,7 @@
 )
 
 ;; Unpredicated saturating unsigned addition and subtraction.
-(define_insn "s3"
+(define_insn "@aarch64_sve_"
   [(set (match_operand:SVE_FULL_I 0 "register_operand")
(UBINQOPS:SVE_FULL_I
  (match_operand:SVE_FULL_I 1 "register_operand")
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc 
b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc
deleted file mode 100644
index 0b3ebbcb0d6f..
--- a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc
+++ /dev/null
@@ -1,68 +0,0 @@
-/* Template file for vector saturating arithmetic validation.
-
-   This file defines saturating addition and subtraction functions for a given
-   scalar type, testing the auto-vectorization of these two operators. This
-   type, along with the corresponding minimum and maximum values for that type,
-   must be defined by any test file which includes this template file.  */
-
-#ifndef SAT_ARIT_AUTOVEC_INC
-#define SAT_ARIT_AUTOVEC_INC
-
-#include 
-#include 
-
-#ifndef UT
-#define UT uint32_t
-#define UMAX UINT_MAX
-#define UMIN 0
-#endif
-
-void uaddq (UT *out, UT *a, UT *b, int n)
-{
-  for (int i = 0; i < n; i++)
-{
-  UT sum = a[i] + b[i];
-  out[i] = sum < a[i] ? UMAX : sum;
-}
-}
-
-void uaddq2 (UT *out, UT *a, UT *b, int n)
-{
-  for (int i = 0; i < n; i++)
-{
-  UT sum;
-  if (!__builtin_add_overflow(a[i], b[i], &sum))
-   out[i] = sum;
-  else
-   out[i] = UMAX;
-}
-}
-
-void uaddq_imm (UT *out, UT *a, int n)
-{
-  for (int i = 0; i < n; i++)
-{
-  UT sum = a[i] + 50;
-  out[i] = sum < a[i] ? UMAX : sum;
-}
-}
-
-void usubq (UT *out, UT *a, UT *b, int n)
-{
-  for (int i = 0; i < n; i++)
-{
-  UT sum = a[i] - b[i];
-  out[i] = sum > a[i] ? UMIN : sum;
-}
-}
-
-void usubq_imm (UT *out, UT *a, int n)
-{
-  for (int i = 0; i < n; i++)
-{
-  UT sum = a[i] - 50;
-  out[i] = sum > a[i] ? UMIN : sum;
-}
-}
-
-#endif
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c
deleted file mode 100644
index 6936e9a27044..
--- a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c
+++ /dev/null
@@ -1,60 +0,0 @@
-/* { dg-do compile { target { aarch64*-*-* } } } */
-/* { dg-options "-O2 --save-temps -ftree-vectorize" } */
-/* { dg-final { check-function-bodies "**" "" "" } } */
-
-/*
-** uaddq:
-** ...
-** ld1b\tz([0-9]+)\.b, .*
-** ld1b\tz([0-9]+)\.b, .*
-** uqadd\tz\2.b, z\1\.b, z\2\.b
-** ...
-** ldr\tb([0-9]+), .*
-** ldr\tb([0-9]+), .*
-** uqadd\tb\4, b\3, b\4
-** ...
-*/
-/*
-** uaddq2:
-** ...
-** ld1b\tz([0-9]+)\.b, .*
-** ld1b\tz([0-9]+)\.b, .*
-** uqadd\tz\2.b, z\1\.b, z\2\.b
-** ...
-** ldr\tb([0-9]+), .*
-** ldr\tb([0-9]+), .*
-** uqadd\tb\4, b\3, b\4
-** ...
-*/
-/*
-** uaddq_imm:
-** ...
-** ld1b\tz([0-9]+)\.b, .*
-** uqadd\tz\1.b, z\1\.b, #50
-** ...
-** movi\tv([0-9]+)\.8b, 0x32
-** ...
-** ldr\tb([0-9]+), .*
-** uqadd\tb\3, b\3, b\2
-** ...
-*/
-/*
-** usubq: { xfail *-*-* }
-** ...
-** ld1b\tz([0-9]+)\.b, .*
-** ld1b\tz([0-9]+)\.b, .*
-** uqsub\tz\2.b, z\1\.b, z\2\.b
-** ...
-** ldr\tb([0-9]+), .*
-** ldr\tb([0-9]+), .*
-** uqsub\tb\4, b\3, b\4
-** ...
-*/
-
-#include 
-
-#define UT unsigned char
-#define UMAX UCHAR_MAX
-#define UMIN 0
-
-#include "saturating_arithmetic.inc"
\ No newline at end of file
diff --git a/gcc/testsuite/gcc

[gcc r15-7016] Revert "AArch64: Use standard names for saturating arithmetic"

2025-01-18 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:1775a7280a230776927897147f1b07964cf5cfc7

commit r15-7016-g1775a7280a230776927897147f1b07964cf5cfc7
Author: Tamar Christina 
Date:   Sat Jan 18 11:12:38 2025 +

Revert "AArch64: Use standard names for saturating arithmetic"

This reverts commit 5f5833a4107ddfbcd87651bf140151de043f4c36.

Diff:
---
 gcc/config/aarch64/aarch64-builtins.cc |  12 -
 gcc/config/aarch64/aarch64-simd-builtins.def   |   8 +-
 gcc/config/aarch64/aarch64-simd.md | 207 +---
 gcc/config/aarch64/arm_neon.h  |  96 
 gcc/config/aarch64/iterators.md|   4 -
 .../saturating_arithmetic_autovect.inc |  58 -
 .../saturating_arithmetic_autovect_1.c |  79 --
 .../saturating_arithmetic_autovect_2.c |  79 --
 .../saturating_arithmetic_autovect_3.c |  75 --
 .../saturating_arithmetic_autovect_4.c |  77 --
 .../aarch64/saturating-arithmetic-signed.c | 270 -
 .../gcc.target/aarch64/saturating_arithmetic.inc   |  39 ---
 .../gcc.target/aarch64/saturating_arithmetic_1.c   |  36 ---
 .../gcc.target/aarch64/saturating_arithmetic_2.c   |  36 ---
 .../gcc.target/aarch64/saturating_arithmetic_3.c   |  30 ---
 .../gcc.target/aarch64/saturating_arithmetic_4.c   |  30 ---
 .../gcc.target/aarch64/scalar_intrinsics.c |  32 +--
 17 files changed, 72 insertions(+), 1096 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
b/gcc/config/aarch64/aarch64-builtins.cc
index 6d5479c2e449..86eebc168859 100644
--- a/gcc/config/aarch64/aarch64-builtins.cc
+++ b/gcc/config/aarch64/aarch64-builtins.cc
@@ -5039,18 +5039,6 @@ aarch64_general_gimple_fold_builtin (unsigned int fcode, 
gcall *stmt,
  new_stmt = gimple_build_assign (gimple_call_lhs (stmt),
  LSHIFT_EXPR, args[0], args[1]);
break;
-  /* lower saturating add/sub neon builtins to gimple.  */
-  BUILTIN_VSDQ_I (BINOP, ssadd, 3, DEFAULT)
-  BUILTIN_VSDQ_I (BINOPU, usadd, 3, DEFAULT)
-   new_stmt = gimple_build_call_internal (IFN_SAT_ADD, 2, args[0], 
args[1]);
-   gimple_call_set_lhs (new_stmt, gimple_call_lhs (stmt));
-   break;
-  BUILTIN_VSDQ_I (BINOP, sssub, 3, DEFAULT)
-  BUILTIN_VSDQ_I (BINOPU, ussub, 3, DEFAULT)
-   new_stmt = gimple_build_call_internal (IFN_SAT_SUB, 2, args[0], 
args[1]);
-   gimple_call_set_lhs (new_stmt, gimple_call_lhs (stmt));
-   break;
-
   BUILTIN_VSDQ_I_DI (BINOP, sshl, 0, DEFAULT)
   BUILTIN_VSDQ_I_DI (BINOP_UUS, ushl, 0, DEFAULT)
{
diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
b/gcc/config/aarch64/aarch64-simd-builtins.def
index 6cc45b18a723..286272a33118 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -71,10 +71,10 @@
   BUILTIN_VSDQ_I (BINOP, sqrshl, 0, DEFAULT)
   BUILTIN_VSDQ_I (BINOP_UUS, uqrshl, 0, DEFAULT)
   /* Implemented by aarch64_.  */
-  BUILTIN_VSDQ_I (BINOP, ssadd, 3, DEFAULT)
-  BUILTIN_VSDQ_I (BINOPU, usadd, 3, DEFAULT)
-  BUILTIN_VSDQ_I (BINOP, sssub, 3, DEFAULT)
-  BUILTIN_VSDQ_I (BINOPU, ussub, 3, DEFAULT)
+  BUILTIN_VSDQ_I (BINOP, sqadd, 0, DEFAULT)
+  BUILTIN_VSDQ_I (BINOPU, uqadd, 0, DEFAULT)
+  BUILTIN_VSDQ_I (BINOP, sqsub, 0, DEFAULT)
+  BUILTIN_VSDQ_I (BINOPU, uqsub, 0, DEFAULT)
   /* Implemented by aarch64_qadd.  */
   BUILTIN_VSDQ_I (BINOP_SSU, suqadd, 0, DEFAULT)
   BUILTIN_VSDQ_I (BINOP_UUS, usqadd, 0, DEFAULT)
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index e2afe87e5130..eeb626f129a8 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -5162,214 +5162,15 @@
 )
 ;; q
 
-(define_insn "s3"
-  [(set (match_operand:VSDQ_I_QI_HI 0 "register_operand" "=w")
-   (BINQOPS:VSDQ_I_QI_HI
- (match_operand:VSDQ_I_QI_HI 1 "register_operand" "w")
- (match_operand:VSDQ_I_QI_HI 2 "register_operand" "w")))]
+(define_insn "aarch64_q"
+  [(set (match_operand:VSDQ_I 0 "register_operand" "=w")
+   (BINQOPS:VSDQ_I (match_operand:VSDQ_I 1 "register_operand" "w")
+   (match_operand:VSDQ_I 2 "register_operand" "w")))]
   "TARGET_SIMD"
   "q\\t%0, %1, %2"
   [(set_attr "type" "neon_q")]
 )
 
-(define_expand "s3"
-  [(parallel
-[(set (match_operand:GPI 0 "register_operand")
- (SBINQOPS:GPI (match_operand:GPI 1 "register_operand")
-   (match_operand:GPI 2 "aarch64_plus_operand")))
-(clobber (scratch:GPI))
-(clobber (reg:CC CC_REGNUM))])]
-)
-
-;; Introducing a temporary GP reg allows signed saturating arithmetic with GPR
-;; operands to be calculated without the use of costly transfers to and from FP
-;; registers.  For example, saturating addition usually uses three FMOVs:
-;;
-;;   fmov  d0, x0
-;;   fmov  d1, x1
-;;   sqadd d0, d0, d1
-;;   fmov  x0, d0
-;;
-;

[gcc r15-7017] AArch64: Use standard names for SVE saturating arithmetic

2025-01-18 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:8f8ca83f2f6f165c4060ee1fc18ed3c74571ab7a

commit r15-7017-g8f8ca83f2f6f165c4060ee1fc18ed3c74571ab7a
Author: Akram Ahmad 
Date:   Fri Jan 17 17:44:23 2025 +

AArch64: Use standard names for SVE saturating arithmetic

Rename the existing SVE unpredicated saturating arithmetic instructions
to use standard names which are used by IFN_SAT_ADD and IFN_SAT_SUB.

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md: Rename insns

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/saturating_arithmetic.inc:
Template file for auto-vectorizer tests.
* gcc.target/aarch64/sve/saturating_arithmetic_1.c:
Instantiate 8-bit vector tests.
* gcc.target/aarch64/sve/saturating_arithmetic_2.c:
Instantiate 16-bit vector tests.
* gcc.target/aarch64/sve/saturating_arithmetic_3.c:
Instantiate 32-bit vector tests.
* gcc.target/aarch64/sve/saturating_arithmetic_4.c:
Instantiate 64-bit vector tests.

Diff:
---
 gcc/config/aarch64/aarch64-sve.md  |  4 +-
 .../aarch64/sve/saturating_arithmetic.inc  | 68 ++
 .../aarch64/sve/saturating_arithmetic_1.c  | 60 +++
 .../aarch64/sve/saturating_arithmetic_2.c  | 60 +++
 .../aarch64/sve/saturating_arithmetic_3.c  | 62 
 .../aarch64/sve/saturating_arithmetic_4.c  | 62 
 6 files changed, 314 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index ba4b4d904c77..e975286a0190 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -4449,7 +4449,7 @@
 ;; -
 
 ;; Unpredicated saturating signed addition and subtraction.
-(define_insn "@aarch64_sve_"
+(define_insn "s3"
   [(set (match_operand:SVE_FULL_I 0 "register_operand")
(SBINQOPS:SVE_FULL_I
  (match_operand:SVE_FULL_I 1 "register_operand")
@@ -4465,7 +4465,7 @@
 )
 
 ;; Unpredicated saturating unsigned addition and subtraction.
-(define_insn "@aarch64_sve_"
+(define_insn "s3"
   [(set (match_operand:SVE_FULL_I 0 "register_operand")
(UBINQOPS:SVE_FULL_I
  (match_operand:SVE_FULL_I 1 "register_operand")
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc 
b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc
new file mode 100644
index ..0b3ebbcb0d6f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc
@@ -0,0 +1,68 @@
+/* Template file for vector saturating arithmetic validation.
+
+   This file defines saturating addition and subtraction functions for a given
+   scalar type, testing the auto-vectorization of these two operators. This
+   type, along with the corresponding minimum and maximum values for that type,
+   must be defined by any test file which includes this template file.  */
+
+#ifndef SAT_ARIT_AUTOVEC_INC
+#define SAT_ARIT_AUTOVEC_INC
+
+#include 
+#include 
+
+#ifndef UT
+#define UT uint32_t
+#define UMAX UINT_MAX
+#define UMIN 0
+#endif
+
+void uaddq (UT *out, UT *a, UT *b, int n)
+{
+  for (int i = 0; i < n; i++)
+{
+  UT sum = a[i] + b[i];
+  out[i] = sum < a[i] ? UMAX : sum;
+}
+}
+
+void uaddq2 (UT *out, UT *a, UT *b, int n)
+{
+  for (int i = 0; i < n; i++)
+{
+  UT sum;
+  if (!__builtin_add_overflow(a[i], b[i], &sum))
+   out[i] = sum;
+  else
+   out[i] = UMAX;
+}
+}
+
+void uaddq_imm (UT *out, UT *a, int n)
+{
+  for (int i = 0; i < n; i++)
+{
+  UT sum = a[i] + 50;
+  out[i] = sum < a[i] ? UMAX : sum;
+}
+}
+
+void usubq (UT *out, UT *a, UT *b, int n)
+{
+  for (int i = 0; i < n; i++)
+{
+  UT sum = a[i] - b[i];
+  out[i] = sum > a[i] ? UMIN : sum;
+}
+}
+
+void usubq_imm (UT *out, UT *a, int n)
+{
+  for (int i = 0; i < n; i++)
+{
+  UT sum = a[i] - 50;
+  out[i] = sum > a[i] ? UMIN : sum;
+}
+}
+
+#endif
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c
new file mode 100644
index ..6936e9a27044
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c
@@ -0,0 +1,60 @@
+/* { dg-do compile { target { aarch64*-*-* } } } */
+/* { dg-options "-O2 --save-temps -ftree-vectorize" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+/*
+** uaddq:
+** ...
+** ld1b\tz([0-9]+)\.b, .*
+** ld1b\tz([0-9]+)\.b, .*
+** uqadd\tz\2.b, z\1\.b, z\2\.b
+** ...
+** ldr\tb([0-9]+), .*
+** ldr\tb([0-9]+), .*
+** uqadd\tb\4, b\3, b\4
+** ...
+*/
+/*
+** uaddq2:
+** ...
+** ld1b\tz([0-9]+)\.b, .*
+** ld1b\tz([0-9]+)\.b, .*
+** uqadd\tz\2.b, z\1\.b,

[gcc r13-9351] AArch64: have -mcpu=native detect architecture extensions for unknown non-homogenous systems [PR1132

2025-01-28 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:eb45b829bb3fb658aa34a340264dee9755d34e69

commit r13-9351-geb45b829bb3fb658aa34a340264dee9755d34e69
Author: Tamar Christina 
Date:   Thu Jan 16 19:25:26 2025 +

AArch64: have -mcpu=native detect architecture extensions for unknown 
non-homogenous systems [PR113257]

in g:e91a17fe39c39e98cebe6e1cbc8064ee6846a3a7 we added the ability for
-mcpu=native on unknown CPUs to still enable architecture extensions.

This has worked great but was only added for homogenous systems.

However the same thing works for big.LITTLE as in such system the cores must
have the same extensions otherwise it doesn't fundamentally work.

i.e. task migration from one core to the other wouldn't work.

This extends the same handling to non-homogenous systems.

gcc/ChangeLog:

PR target/113257
* config/aarch64/driver-aarch64.cc (get_cpu_from_id, DEFAULT_CPU): 
New.
(host_detect_local_cpu): Use it.

gcc/testsuite/ChangeLog:

PR target/113257
* gcc.target/aarch64/cpunative/info_34: New test.
* gcc.target/aarch64/cpunative/native_cpu_34.c: New test.
* gcc.target/aarch64/cpunative/info_35: New test.
* gcc.target/aarch64/cpunative/native_cpu_35.c: New test.

Co-authored-by: Richard Sandiford 
(cherry picked from commit 1ff85affe46623fe1a970de95887df22f4da9d16)

Diff:
---
 gcc/config/aarch64/driver-aarch64.cc   | 52 --
 gcc/testsuite/gcc.target/aarch64/cpunative/info_34 | 18 
 gcc/testsuite/gcc.target/aarch64/cpunative/info_35 | 18 
 .../gcc.target/aarch64/cpunative/native_cpu_34.c   | 12 +
 .../gcc.target/aarch64/cpunative/native_cpu_35.c   | 13 ++
 5 files changed, 99 insertions(+), 14 deletions(-)

diff --git a/gcc/config/aarch64/driver-aarch64.cc 
b/gcc/config/aarch64/driver-aarch64.cc
index 8e318892b10a..ff4660f469cd 100644
--- a/gcc/config/aarch64/driver-aarch64.cc
+++ b/gcc/config/aarch64/driver-aarch64.cc
@@ -60,6 +60,7 @@ struct aarch64_core_data
 #define ALL_VARIANTS ((unsigned)-1)
 /* Default architecture to use if -mcpu=native did not detect a known CPU.  */
 #define DEFAULT_ARCH "8A"
+#define DEFAULT_CPU "generic-armv8-a"
 
 #define AARCH64_CORE(CORE_NAME, CORE_IDENT, SCHED, ARCH, FLAGS, COSTS, IMP, 
PART, VARIANT) \
   { CORE_NAME, #ARCH, IMP, PART, VARIANT, feature_deps::cpu_##CORE_IDENT },
@@ -106,6 +107,19 @@ get_arch_from_id (const char* id)
   return NULL;
 }
 
+/* Return an aarch64_core_data for the cpu described
+   by ID, or NULL if ID describes something we don't know about.  */
+
+static const aarch64_core_data *
+get_cpu_from_id (const char* name)
+{
+  for (unsigned i = 0; aarch64_cpu_data[i].name != NULL; i++)
+if (strcmp (name, aarch64_cpu_data[i].name) == 0)
+  return &aarch64_cpu_data[i];
+
+  return NULL;
+}
+
 /* Check wether the CORE array is the same as the big.LITTLE BL_CORE.
For an example CORE={0xd08, 0xd03} and
BL_CORE=AARCH64_BIG_LITTLE (0xd08, 0xd03) will return true.  */
@@ -394,18 +408,11 @@ host_detect_local_cpu (int argc, const char **argv)
 || variants[0] == aarch64_cpu_data[i].variant))
  break;
 
-  if (aarch64_cpu_data[i].name == NULL)
+  if (arch)
{
- auto arch_info = get_arch_from_id (DEFAULT_ARCH);
-
- gcc_assert (arch_info);
-
- res = concat ("-march=", arch_info->name, NULL);
- default_flags = arch_info->flags;
-   }
-  else if (arch)
-   {
- const char *arch_id = aarch64_cpu_data[i].arch;
+ const char *arch_id = (aarch64_cpu_data[i].name
+? aarch64_cpu_data[i].arch
+: DEFAULT_ARCH);
  auto arch_info = get_arch_from_id (arch_id);
 
  /* We got some arch indentifier that's not in aarch64-arches.def?  */
@@ -415,12 +422,15 @@ host_detect_local_cpu (int argc, const char **argv)
  res = concat ("-march=", arch_info->name, NULL);
  default_flags = arch_info->flags;
}
-  else
+  else if (cpu || aarch64_cpu_data[i].name)
{
- default_flags = aarch64_cpu_data[i].flags;
+ auto cpu_info = (aarch64_cpu_data[i].name
+  ? &aarch64_cpu_data[i]
+  : get_cpu_from_id (DEFAULT_CPU));
+ default_flags = cpu_info->flags;
  res = concat ("-m",
cpu ? "cpu" : "tune", "=",
-   aarch64_cpu_data[i].name,
+   cpu_info->name,
NULL);
}
 }
@@ -440,6 +450,20 @@ host_detect_local_cpu (int argc, const char **argv)
  break;
}
}
+
+  /* On big.LITTLE if we find any unknown CPUs we can still pick arch
+features as the cores should have the same features.  So just pick
+the feature flags from any of the cpus.

[gcc r13-9352] AArch64: don't override march to assembler with mcpu if march is specified [PR110901]

2025-01-28 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:57a9595f05efe2839a39e711c6cf3ce21ca1ff33

commit r13-9352-g57a9595f05efe2839a39e711c6cf3ce21ca1ff33
Author: Tamar Christina 
Date:   Thu Jan 16 19:23:50 2025 +

AArch64: don't override march to assembler with mcpu if march is specified 
[PR110901]

When both -mcpu and -march are specified, the value of -march wins out.

This is done correctly for the calls to cc1 and for the assembler 
directives we
put out in assembly files.

However in the call to as we don't do this and instead use the arch from the
cpu.  This leads to a situation that GCC cannot reliably be used to compile
assembly files which don't have a .arch directive.

This is quite common with .S files which use macros to selectively enable
codepath based on what the preprocessor sees.

The fix is to change MCPU_TO_MARCH_SPEC to not override the march if an 
march
is already specified.

gcc/ChangeLog:

PR target/110901
* config/aarch64/aarch64.h (MCPU_TO_MARCH_SPEC): Don't override if
march is set.

gcc/testsuite/ChangeLog:

PR target/110901
* gcc.target/aarch64/options_set_29.c: New test.

(cherry picked from commit 773beeaafb0ea31bd4e308b64781731d64b571ce)

Diff:
---
 gcc/config/aarch64/aarch64.h  |  2 +-
 gcc/testsuite/gcc.target/aarch64/options_set_29.c | 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 996a261334a6..77e40c17e354 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -1233,7 +1233,7 @@ extern const char *host_detect_local_cpu (int argc, const 
char **argv);
   CONFIG_TUNE_SPEC
 
 #define MCPU_TO_MARCH_SPEC \
-   " %{mcpu=*:-march=%:rewrite_mcpu(%{mcpu=*:%*})}"
+   "%{!march=*:%{mcpu=*:-march=%:rewrite_mcpu(%{mcpu=*:%*})}}"
 
 extern const char *aarch64_rewrite_mcpu (int argc, const char **argv);
 #define MCPU_TO_MARCH_SPEC_FUNCTIONS \
diff --git a/gcc/testsuite/gcc.target/aarch64/options_set_29.c 
b/gcc/testsuite/gcc.target/aarch64/options_set_29.c
new file mode 100644
index ..0a68550951ce
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/options_set_29.c
@@ -0,0 +1,11 @@
+/* { dg-do assemble } */
+/* { dg-additional-options "-march=armv8.2-a+sve -mcpu=cortex-a72 -O1 -w -###" 
} */
+
+int main ()
+{
+  return 0;
+}
+
+/* { dg-message "-march=armv8-a\+crc" "no arch from cpu" { xfail *-*-* } 0 } */
+/* { dg-message "-march=armv8\\.2-a\\+sve" "using only sve" { target *-*-* } 0 
} */
+/* { dg-excess-errors "" } */

[gcc r14-11255] AArch64: don't override march to assembler with mcpu if march is specified [PR110901]

2025-01-28 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:f8daec2ad9a20c31a98efb4602080e1e5d0c19fe

commit r14-11255-gf8daec2ad9a20c31a98efb4602080e1e5d0c19fe
Author: Tamar Christina 
Date:   Thu Jan 16 19:23:50 2025 +

AArch64: don't override march to assembler with mcpu if march is specified 
[PR110901]

When both -mcpu and -march are specified, the value of -march wins out.

This is done correctly for the calls to cc1 and for the assembler 
directives we
put out in assembly files.

However in the call to as we don't do this and instead use the arch from the
cpu.  This leads to a situation that GCC cannot reliably be used to compile
assembly files which don't have a .arch directive.

This is quite common with .S files which use macros to selectively enable
codepath based on what the preprocessor sees.

The fix is to change MCPU_TO_MARCH_SPEC to not override the march if an 
march
is already specified.

gcc/ChangeLog:

PR target/110901
* config/aarch64/aarch64.h (MCPU_TO_MARCH_SPEC): Don't override if
march is set.

gcc/testsuite/ChangeLog:

PR target/110901
* gcc.target/aarch64/options_set_29.c: New test.

(cherry picked from commit 773beeaafb0ea31bd4e308b64781731d64b571ce)

Diff:
---
 gcc/config/aarch64/aarch64.h  |  2 +-
 gcc/testsuite/gcc.target/aarch64/options_set_29.c | 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 4fa1dfc79065..fe02a02a57b3 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -1448,7 +1448,7 @@ extern const char *host_detect_local_cpu (int argc, const 
char **argv);
   CONFIG_TUNE_SPEC
 
 #define MCPU_TO_MARCH_SPEC \
-   " %{mcpu=*:-march=%:rewrite_mcpu(%{mcpu=*:%*})}"
+   "%{!march=*:%{mcpu=*:-march=%:rewrite_mcpu(%{mcpu=*:%*})}}"
 
 extern const char *aarch64_rewrite_mcpu (int argc, const char **argv);
 #define MCPU_TO_MARCH_SPEC_FUNCTIONS \
diff --git a/gcc/testsuite/gcc.target/aarch64/options_set_29.c 
b/gcc/testsuite/gcc.target/aarch64/options_set_29.c
new file mode 100644
index ..0a68550951ce
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/options_set_29.c
@@ -0,0 +1,11 @@
+/* { dg-do assemble } */
+/* { dg-additional-options "-march=armv8.2-a+sve -mcpu=cortex-a72 -O1 -w -###" 
} */
+
+int main ()
+{
+  return 0;
+}
+
+/* { dg-message "-march=armv8-a\+crc" "no arch from cpu" { xfail *-*-* } 0 } */
+/* { dg-message "-march=armv8\\.2-a\\+sve" "using only sve" { target *-*-* } 0 
} */
+/* { dg-excess-errors "" } */

[gcc r14-11254] AArch64: have -mcpu=native detect architecture extensions for unknown non-homogenous systems [PR1132

2025-01-28 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:7c6fde4bac6c20e0b04c3feb820abe5ce0e48d9b

commit r14-11254-g7c6fde4bac6c20e0b04c3feb820abe5ce0e48d9b
Author: Tamar Christina 
Date:   Thu Jan 16 19:25:26 2025 +

AArch64: have -mcpu=native detect architecture extensions for unknown 
non-homogenous systems [PR113257]

in g:e91a17fe39c39e98cebe6e1cbc8064ee6846a3a7 we added the ability for
-mcpu=native on unknown CPUs to still enable architecture extensions.

This has worked great but was only added for homogenous systems.

However the same thing works for big.LITTLE as in such system the cores must
have the same extensions otherwise it doesn't fundamentally work.

i.e. task migration from one core to the other wouldn't work.

This extends the same handling to non-homogenous systems.

gcc/ChangeLog:

PR target/113257
* config/aarch64/driver-aarch64.cc (get_cpu_from_id, DEFAULT_CPU): 
New.
(host_detect_local_cpu): Use it.

gcc/testsuite/ChangeLog:

PR target/113257
* gcc.target/aarch64/cpunative/info_34: New test.
* gcc.target/aarch64/cpunative/native_cpu_34.c: New test.
* gcc.target/aarch64/cpunative/info_35: New test.
* gcc.target/aarch64/cpunative/native_cpu_35.c: New test.

Co-authored-by: Richard Sandiford 
(cherry picked from commit 1ff85affe46623fe1a970de95887df22f4da9d16)

Diff:
---
 gcc/config/aarch64/driver-aarch64.cc   | 52 --
 gcc/testsuite/gcc.target/aarch64/cpunative/info_34 | 18 
 gcc/testsuite/gcc.target/aarch64/cpunative/info_35 | 18 
 .../gcc.target/aarch64/cpunative/native_cpu_34.c   | 12 +
 .../gcc.target/aarch64/cpunative/native_cpu_35.c   | 13 ++
 5 files changed, 99 insertions(+), 14 deletions(-)

diff --git a/gcc/config/aarch64/driver-aarch64.cc 
b/gcc/config/aarch64/driver-aarch64.cc
index b620351e5720..fa0c57e60749 100644
--- a/gcc/config/aarch64/driver-aarch64.cc
+++ b/gcc/config/aarch64/driver-aarch64.cc
@@ -60,6 +60,7 @@ struct aarch64_core_data
 #define ALL_VARIANTS ((unsigned)-1)
 /* Default architecture to use if -mcpu=native did not detect a known CPU.  */
 #define DEFAULT_ARCH "8A"
+#define DEFAULT_CPU "generic-armv8-a"
 
 #define AARCH64_CORE(CORE_NAME, CORE_IDENT, SCHED, ARCH, FLAGS, COSTS, IMP, 
PART, VARIANT) \
   { CORE_NAME, #ARCH, IMP, PART, VARIANT, feature_deps::cpu_##CORE_IDENT },
@@ -106,6 +107,19 @@ get_arch_from_id (const char* id)
   return NULL;
 }
 
+/* Return an aarch64_core_data for the cpu described
+   by ID, or NULL if ID describes something we don't know about.  */
+
+static const aarch64_core_data *
+get_cpu_from_id (const char* name)
+{
+  for (unsigned i = 0; aarch64_cpu_data[i].name != NULL; i++)
+if (strcmp (name, aarch64_cpu_data[i].name) == 0)
+  return &aarch64_cpu_data[i];
+
+  return NULL;
+}
+
 /* Check wether the CORE array is the same as the big.LITTLE BL_CORE.
For an example CORE={0xd08, 0xd03} and
BL_CORE=AARCH64_BIG_LITTLE (0xd08, 0xd03) will return true.  */
@@ -399,18 +413,11 @@ host_detect_local_cpu (int argc, const char **argv)
 || variants[0] == aarch64_cpu_data[i].variant))
  break;
 
-  if (aarch64_cpu_data[i].name == NULL)
+  if (arch)
{
- auto arch_info = get_arch_from_id (DEFAULT_ARCH);
-
- gcc_assert (arch_info);
-
- res = concat ("-march=", arch_info->name, NULL);
- default_flags = arch_info->flags;
-   }
-  else if (arch)
-   {
- const char *arch_id = aarch64_cpu_data[i].arch;
+ const char *arch_id = (aarch64_cpu_data[i].name
+? aarch64_cpu_data[i].arch
+: DEFAULT_ARCH);
  auto arch_info = get_arch_from_id (arch_id);
 
  /* We got some arch indentifier that's not in aarch64-arches.def?  */
@@ -420,12 +427,15 @@ host_detect_local_cpu (int argc, const char **argv)
  res = concat ("-march=", arch_info->name, NULL);
  default_flags = arch_info->flags;
}
-  else
+  else if (cpu || aarch64_cpu_data[i].name)
{
- default_flags = aarch64_cpu_data[i].flags;
+ auto cpu_info = (aarch64_cpu_data[i].name
+  ? &aarch64_cpu_data[i]
+  : get_cpu_from_id (DEFAULT_CPU));
+ default_flags = cpu_info->flags;
  res = concat ("-m",
cpu ? "cpu" : "tune", "=",
-   aarch64_cpu_data[i].name,
+   cpu_info->name,
NULL);
}
 }
@@ -445,6 +455,20 @@ host_detect_local_cpu (int argc, const char **argv)
  break;
}
}
+
+  /* On big.LITTLE if we find any unknown CPUs we can still pick arch
+features as the cores should have the same features.  So just pick
+the feature flags from any of the cpus

[gcc r15-7095] middle-end: use ncopies both when registering and reading masks [PR118273]

2025-01-21 Thread Tamar Christina via Gcc-cvs

https://gcc.gnu.org/g:1dd79f44dfb64b441f3d6c64e7f909d73441bd05

commit r15-7095-g1dd79f44dfb64b441f3d6c64e7f909d73441bd05
Author: Tamar Christina 
Date:   Tue Jan 21 10:29:08 2025 +

middle-end: use ncopies both when registering and reading masks [PR118273]

When registering masks for SIMD clone we end up using nmasks instead of
nvectors where nmasks seems to compute the number of input masks required 
for
the call given the current simdlen.

This is however wrong as vect_record_loop_mask wants to know how many masks 
you
want to create from the given vectype. i.e. which level of rgroups to 
create.

This ends up mismatching with vect_get_loop_mask which uses nvectors and if 
the
return type is narrower than the input types there will be a mismatch which
causes us to try to read from the given rgroup.  It only happens to work if 
the
function had an additional argument that's wider or if all elements and 
return
types are the same size.

This fixes it by using nvectors during registration as well, which has 
already
taken into account SLP and VF.

gcc/ChangeLog:

PR middle-end/118273
* tree-vect-stmts.cc (vectorizable_simd_clone_call): Use nvectors 
when
doing mask registrations.

gcc/testsuite/ChangeLog:

PR middle-end/118273
* gcc.target/aarch64/vect-simd-clone-4.c: New test.

Diff:
---
 gcc/testsuite/gcc.target/aarch64/vect-simd-clone-4.c | 15 +++
 gcc/tree-vect-stmts.cc   | 11 +++
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/vect-simd-clone-4.c 
b/gcc/testsuite/gcc.target/aarch64/vect-simd-clone-4.c
new file mode 100644
index ..9b52af703933
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-simd-clone-4.c
@@ -0,0 +1,15 @@
+/* { dg-do compile }  */
+/* { dg-options "-std=c99" } */
+/* { dg-additional-options "-O3 -march=armv8-a" } */
+
+#pragma GCC target ("+sve")
+
+extern char __attribute__ ((simd, const)) fn3 (short);
+void test_fn3 (float *a, float *b, double *c, int n)
+{
+  for (int i = 0; i < n; ++i)
+a[i] = fn3 (c[i]);
+}
+
+/* { dg-final { scan-assembler {\s+_ZGVsMxv_fn3\n} } } */
+
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 833029fcb001..21fb5cf5bd47 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4561,14 +4561,9 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
case SIMD_CLONE_ARG_TYPE_MASK:
  if (loop_vinfo
  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
-   {
- unsigned nmasks
-   = exact_div (ncopies * bestn->simdclone->simdlen,
-TYPE_VECTOR_SUBPARTS (vectype)).to_constant ();
- vect_record_loop_mask (loop_vinfo,
-&LOOP_VINFO_MASKS (loop_vinfo),
-nmasks, vectype, op);
-   }
+   vect_record_loop_mask (loop_vinfo,
+  &LOOP_VINFO_MASKS (loop_vinfo),
+  ncopies, vectype, op);
 
  break;
}

1 2 >

1 - 100 of 150 matches

Mail list logo