[gcc r15-3477] docs: double mention of armv9-a.
https://gcc.gnu.org/g:240be78237c6d70e0b30ed187c559e359ce81557 commit r15-3477-g240be78237c6d70e0b30ed187c559e359ce81557 Author: Tamar Christina Date: Thu Sep 5 10:35:18 2024 +0100 docs: double mention of armv9-a. The list of available architecture for Arm is incorrectly listing armv9-a twice. This removes the duplicate armv9-a enumeration from the part of the list having M-profile targets. gcc/ChangeLog: * doc/invoke.texi: Remove duplicate armv9-a mention. Diff: --- gcc/doc/invoke.texi | 1 - 1 file changed, 1 deletion(-) diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 43afb0984e5..193db761d64 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -23025,7 +23025,6 @@ Permissible names are: @samp{armv7-m}, @samp{armv7e-m}, @samp{armv8-m.base}, @samp{armv8-m.main}, @samp{armv8.1-m.main}, -@samp{armv9-a}, @samp{iwmmxt} and @samp{iwmmxt2}. Additionally, the following architectures, which lack support for the
[gcc r15-3478] testsuite: remove -fwrapv from signbit-5.c
https://gcc.gnu.org/g:67eaf67360e434dd5969e1c66f043e3c751f9f52 commit r15-3478-g67eaf67360e434dd5969e1c66f043e3c751f9f52 Author: Tamar Christina Date: Thu Sep 5 10:36:02 2024 +0100 testsuite: remove -fwrapv from signbit-5.c The meaning of the testcase was changed by passing it -fwrapv. The reason for the test failures on some platform was because the test was testing some implementation defined behavior wrt INT_MIN in generic code. Instead of using -fwrapv this just removes the border case from the test so all the values now have a defined semantic. It still relies on the handling of shifting a negative value right, but that wasn't changed with -fwrapv anyway. The -fwrapv case is being handled already by other testcases. gcc/testsuite/ChangeLog: * gcc.dg/signbit-5.c: Remove -fwrapv and change INT_MIN to INT_MIN+1. Diff: --- gcc/testsuite/gcc.dg/signbit-5.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/testsuite/gcc.dg/signbit-5.c b/gcc/testsuite/gcc.dg/signbit-5.c index 57e29e3ca63..2601582ed4e 100644 --- a/gcc/testsuite/gcc.dg/signbit-5.c +++ b/gcc/testsuite/gcc.dg/signbit-5.c @@ -1,5 +1,5 @@ /* { dg-do run } */ -/* { dg-options "-O3 -fwrapv" } */ +/* { dg-options "-O3" } */ /* This test does not work when the truth type does not match vector type. */ /* { dg-additional-options "-march=armv8-a" { target aarch64_sve } } */ @@ -42,8 +42,8 @@ int main () TYPE a[N]; TYPE b[N]; - a[0] = INT_MIN; - b[0] = INT_MIN; + a[0] = INT_MIN+1; + b[0] = INT_MIN+1; for (int i = 1; i < N; ++i) {
[gcc r15-3479] middle-end: have vect_recog_cond_store_pattern use pattern statement for cond if available
https://gcc.gnu.org/g:a50f54c0d06139d791b875e09471f2fc03af5b04 commit r15-3479-ga50f54c0d06139d791b875e09471f2fc03af5b04 Author: Tamar Christina Date: Thu Sep 5 10:36:55 2024 +0100 middle-end: have vect_recog_cond_store_pattern use pattern statement for cond if available When vectorizing a conditional operation we rely on the bool_recog pattern to hit and convert the bool of the operand to a valid mask. However we are currently not using the converted operand as this is in a pattern statement. This change updates it to look at the actual statement to be vectorized so we pick up the pattern. Note that there are no tests here since vectorization will fail until we correctly lower all boolean conditionals early. Tests for these are in the next patch, namely vect-conditional_store_5.c and vect-conditional_store_6.c. And the existing vect-conditional_store_[1-4].c checks that the other cases are still handled correctly. gcc/ChangeLog: * tree-vect-patterns.cc (vect_recog_cond_store_pattern): Use pattern statement. Diff: --- gcc/tree-vect-patterns.cc | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index 3162250bbdd..f7c3c623ea4 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -6670,7 +6670,15 @@ vect_recog_cond_store_pattern (vec_info *vinfo, if (TREE_CODE (st_rhs) != SSA_NAME) return NULL; - gassign *cond_stmt = dyn_cast (SSA_NAME_DEF_STMT (st_rhs)); + auto cond_vinfo = vinfo->lookup_def (st_rhs); + + /* If the condition isn't part of the loop then bool recog wouldn't have seen + it and so this transformation may not be valid. */ + if (!cond_vinfo) +return NULL; + + cond_vinfo = vect_stmt_to_vectorize (cond_vinfo); + gassign *cond_stmt = dyn_cast (STMT_VINFO_STMT (cond_vinfo)); if (!cond_stmt || gimple_assign_rhs_code (cond_stmt) != COND_EXPR) return NULL;
[gcc r15-3518] middle-end: check that the lhs of a COND_EXPR is an SSA_NAME in cond_store recognition [PR116628]
https://gcc.gnu.org/g:2c4438d39156493b5b382eb48b1f884ca5ab7ed4 commit r15-3518-g2c4438d39156493b5b382eb48b1f884ca5ab7ed4 Author: Tamar Christina Date: Fri Sep 6 14:05:43 2024 +0100 middle-end: check that the lhs of a COND_EXPR is an SSA_NAME in cond_store recognition [PR116628] Because the vect_recog_bool_pattern can at the moment still transition out of GIMPLE and back into GENERIC the vect_recog_cond_store_pattern can end up using an expression as a mask rather than an SSA_NAME. This adds an explicit check that we have a mask and not an expression. gcc/ChangeLog: PR tree-optimization/116628 * tree-vect-patterns.cc (vect_recog_cond_store_pattern): Add SSA_NAME check on expression. gcc/testsuite/ChangeLog: PR tree-optimization/116628 * gcc.dg/vect/pr116628.c: New test. Diff: --- gcc/testsuite/gcc.dg/vect/pr116628.c | 14 ++ gcc/tree-vect-patterns.cc| 3 +++ 2 files changed, 17 insertions(+) diff --git a/gcc/testsuite/gcc.dg/vect/pr116628.c b/gcc/testsuite/gcc.dg/vect/pr116628.c new file mode 100644 index 000..4068c657ac5 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr116628.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_float } */ +/* { dg-require-effective-target vect_masked_store } */ +/* { dg-additional-options "-Ofast -march=armv9-a" { target aarch64-*-* } } */ + +typedef float c; +c a[2000], b[0]; +void d() { + for (int e = 0; e < 2000; e++) +if (b[e]) + a[e] = b[e]; +} + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index f7c3c623ea4..3a0d4cb7092 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -6685,6 +6685,9 @@ vect_recog_cond_store_pattern (vec_info *vinfo, /* Check if the else value matches the original loaded one. */ bool invert = false; tree cmp_ls = gimple_arg (cond_stmt, 0); + if (TREE_CODE (cmp_ls) != SSA_NAME) +return NULL; + tree cond_arg1 = gimple_arg (cond_stmt, 1); tree cond_arg2 = gimple_arg (cond_stmt, 2);
[gcc r15-1808] ivopts: fix wide_int_constant_multiple_p when VAL and DIV are 0. [PR114932]
https://gcc.gnu.org/g:25127123100f04c2d5d70c6933a5f5aedcd69c40 commit r15-1808-g25127123100f04c2d5d70c6933a5f5aedcd69c40 Author: Tamar Christina Date: Wed Jul 3 09:30:28 2024 +0100 ivopts: fix wide_int_constant_multiple_p when VAL and DIV are 0. [PR114932] wide_int_constant_multiple_p tries to check if for two tree expressions a and b that there is a multiplier which makes a == b * c. This code however seems to think that there's no c where a=0 and b=0 are equal which is of course wrong. This fixes it and also fixes the comment. gcc/ChangeLog: PR tree-optimization/114932 * tree-affine.cc (wide_int_constant_multiple_p): Support 0 and 0 being multiples. Diff: --- gcc/tree-affine.cc | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/gcc/tree-affine.cc b/gcc/tree-affine.cc index d6309c43903..76117aa4fd6 100644 --- a/gcc/tree-affine.cc +++ b/gcc/tree-affine.cc @@ -880,11 +880,11 @@ free_affine_expand_cache (hash_map **cache) *cache = NULL; } -/* If VAL != CST * DIV for any constant CST, returns false. - Otherwise, if *MULT_SET is true, additionally compares CST and MULT, - and if they are different, returns false. Finally, if neither of these - two cases occur, true is returned, and CST is stored to MULT and MULT_SET - is set to true. */ +/* If VAL == CST * DIV for any constant CST, returns true. + and if *MULT_SET is true, additionally compares CST and MULT + and if they are different, returns false. If true is returned, CST is + stored to MULT and MULT_SET is set to true unless VAL and DIV are both zero + in which case neither MULT nor MULT_SET are updated. */ static bool wide_int_constant_multiple_p (const poly_widest_int &val, @@ -895,6 +895,9 @@ wide_int_constant_multiple_p (const poly_widest_int &val, if (known_eq (val, 0)) { + if (known_eq (div, 0)) + return true; + if (*mult_set && maybe_ne (*mult, 0)) return false; *mult_set = true;
[gcc r15-1809] ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]
https://gcc.gnu.org/g:735edbf1e2479fa2323a2b4a9714fae1a0925f74 commit r15-1809-g735edbf1e2479fa2323a2b4a9714fae1a0925f74 Author: Tamar Christina Date: Wed Jul 3 09:31:09 2024 +0100 ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932] The current implementation of constant_multiple_of is doing a more limited version of aff_combination_constant_multiple_p. The only non-debug usage of constant_multiple_of will proceed with the values as affine trees. There is scope for further optimization here, namely I believe that if constant_multiple_of returns the aff_tree after the conversion then get_computation_aff_1 can use it instead of manually creating the aff_tree. However I think it makes sense to first commit this smaller change and then incrementally change things. gcc/ChangeLog: PR tree-optimization/114932 * tree-ssa-loop-ivopts.cc (constant_multiple_of): Use aff_combination_constant_multiple_p instead. Diff: --- gcc/tree-ssa-loop-ivopts.cc | 66 ++--- 1 file changed, 8 insertions(+), 58 deletions(-) diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc index 7cae5bdefea..c3218a3e8ee 100644 --- a/gcc/tree-ssa-loop-ivopts.cc +++ b/gcc/tree-ssa-loop-ivopts.cc @@ -2146,65 +2146,15 @@ idx_record_use (tree base, tree *idx, static bool constant_multiple_of (tree top, tree bot, widest_int *mul) { - tree mby; - enum tree_code code; - unsigned precision = TYPE_PRECISION (TREE_TYPE (top)); - widest_int res, p0, p1; - - STRIP_NOPS (top); - STRIP_NOPS (bot); - - if (operand_equal_p (top, bot, 0)) -{ - *mul = 1; - return true; -} - - code = TREE_CODE (top); - switch (code) -{ -case MULT_EXPR: - mby = TREE_OPERAND (top, 1); - if (TREE_CODE (mby) != INTEGER_CST) - return false; - - if (!constant_multiple_of (TREE_OPERAND (top, 0), bot, &res)) - return false; - - *mul = wi::sext (res * wi::to_widest (mby), precision); - return true; - -case PLUS_EXPR: -case MINUS_EXPR: - if (!constant_multiple_of (TREE_OPERAND (top, 0), bot, &p0) - || !constant_multiple_of (TREE_OPERAND (top, 1), bot, &p1)) - return false; - - if (code == MINUS_EXPR) - p1 = -p1; - *mul = wi::sext (p0 + p1, precision); - return true; - -case INTEGER_CST: - if (TREE_CODE (bot) != INTEGER_CST) - return false; - - p0 = widest_int::from (wi::to_wide (top), SIGNED); - p1 = widest_int::from (wi::to_wide (bot), SIGNED); - if (p1 == 0) - return false; - *mul = wi::sext (wi::divmod_trunc (p0, p1, SIGNED, &res), precision); - return res == 0; - -default: - if (POLY_INT_CST_P (top) - && POLY_INT_CST_P (bot) - && constant_multiple_p (wi::to_poly_widest (top), - wi::to_poly_widest (bot), mul)) - return true; + aff_tree aff_top, aff_bot; + tree_to_aff_combination (top, TREE_TYPE (top), &aff_top); + tree_to_aff_combination (bot, TREE_TYPE (bot), &aff_bot); + poly_widest_int poly_mul; + if (aff_combination_constant_multiple_p (&aff_top, &aff_bot, &poly_mul) + && poly_mul.is_constant (mul)) +return true; - return false; -} + return false; } /* Return true if memory reference REF with step STEP may be unaligned. */
[gcc r15-1841] c++ frontend: check for missing condition for novector [PR115623]
https://gcc.gnu.org/g:84acbfbecbdbc3fb2a395bd97e338b2b26fad374 commit r15-1841-g84acbfbecbdbc3fb2a395bd97e338b2b26fad374 Author: Tamar Christina Date: Thu Jul 4 11:01:55 2024 +0100 c++ frontend: check for missing condition for novector [PR115623] It looks like I forgot to check in the C++ frontend if a condition exist for the loop being adorned with novector. This causes a segfault because cond isn't expected to be null. This fixes it by issuing ignoring the pragma when there's no loop condition the same way we do in the C frontend. gcc/cp/ChangeLog: PR c++/115623 * semantics.cc (finish_for_cond): Add check for C++ cond. gcc/testsuite/ChangeLog: PR c++/115623 * g++.dg/vect/vect-novector-pragma_2.cc: New test. Diff: --- gcc/cp/semantics.cc | 2 +- gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc | 10 ++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc index 12d79bdbb3f..cd3df13772d 100644 --- a/gcc/cp/semantics.cc +++ b/gcc/cp/semantics.cc @@ -1510,7 +1510,7 @@ finish_for_cond (tree cond, tree for_stmt, bool ivdep, tree unroll, build_int_cst (integer_type_node, annot_expr_unroll_kind), unroll); - if (novector && cond != error_mark_node) + if (novector && cond && cond != error_mark_node) FOR_COND (for_stmt) = build3 (ANNOTATE_EXPR, TREE_TYPE (FOR_COND (for_stmt)), FOR_COND (for_stmt), diff --git a/gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc b/gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc new file mode 100644 index 000..d2a8eee8d71 --- /dev/null +++ b/gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc @@ -0,0 +1,10 @@ +/* { dg-do compile } */ + +void f (char *a, int i) +{ +#pragma GCC novector + for (;;i++) +a[i] *= 2; +} + +
[gcc r14-10378] c++ frontend: check for missing condition for novector [PR115623]
https://gcc.gnu.org/g:1742b699c31e3ac4dadbedb6036ee2498b569259 commit r14-10378-g1742b699c31e3ac4dadbedb6036ee2498b569259 Author: Tamar Christina Date: Thu Jul 4 11:01:55 2024 +0100 c++ frontend: check for missing condition for novector [PR115623] It looks like I forgot to check in the C++ frontend if a condition exist for the loop being adorned with novector. This causes a segfault because cond isn't expected to be null. This fixes it by issuing ignoring the pragma when there's no loop condition the same way we do in the C frontend. gcc/cp/ChangeLog: PR c++/115623 * semantics.cc (finish_for_cond): Add check for C++ cond. gcc/testsuite/ChangeLog: PR c++/115623 * g++.dg/vect/vect-novector-pragma_2.cc: New test. (cherry picked from commit 84acbfbecbdbc3fb2a395bd97e338b2b26fad374) Diff: --- gcc/cp/semantics.cc | 2 +- gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc | 10 ++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc index b18fc7c61be..ec741c0b203 100644 --- a/gcc/cp/semantics.cc +++ b/gcc/cp/semantics.cc @@ -1501,7 +1501,7 @@ finish_for_cond (tree cond, tree for_stmt, bool ivdep, tree unroll, build_int_cst (integer_type_node, annot_expr_unroll_kind), unroll); - if (novector && cond != error_mark_node) + if (novector && cond && cond != error_mark_node) FOR_COND (for_stmt) = build3 (ANNOTATE_EXPR, TREE_TYPE (FOR_COND (for_stmt)), FOR_COND (for_stmt), diff --git a/gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc b/gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc new file mode 100644 index 000..d2a8eee8d71 --- /dev/null +++ b/gcc/testsuite/g++.dg/vect/vect-novector-pragma_2.cc @@ -0,0 +1,10 @@ +/* { dg-do compile } */ + +void f (char *a, int i) +{ +#pragma GCC novector + for (;;i++) +a[i] *= 2; +} + +
[gcc r15-1842] testsuite: Update test for PR115537 to use SVE .
https://gcc.gnu.org/g:adcfb4fb8fb20a911c795312ff5f5284dba05275 commit r15-1842-gadcfb4fb8fb20a911c795312ff5f5284dba05275 Author: Tamar Christina Date: Thu Jul 4 11:19:20 2024 +0100 testsuite: Update test for PR115537 to use SVE . The PR was about SVE codegen, the testcase accidentally used neoverse-n1 instead of neoverse-v1 as was the original report. This updates the tool options. gcc/testsuite/ChangeLog: PR tree-optimization/115537 * gcc.dg/vect/pr115537.c: Update flag from neoverse-n1 to neoverse-v1. Diff: --- gcc/testsuite/gcc.dg/vect/pr115537.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr115537.c b/gcc/testsuite/gcc.dg/vect/pr115537.c index 99ed467feb8..9f7347a5f2a 100644 --- a/gcc/testsuite/gcc.dg/vect/pr115537.c +++ b/gcc/testsuite/gcc.dg/vect/pr115537.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-additional-options "-mcpu=neoverse-n1" { target aarch64*-*-* } } */ +/* { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } } */ char *a; int b;
[gcc r15-1855] AArch64: remove aarch64_simd_vec_unpack_lo_
https://gcc.gnu.org/g:6ff698106644af39da9e0eda51974fdcd111280d commit r15-1855-g6ff698106644af39da9e0eda51974fdcd111280d Author: Tamar Christina Date: Fri Jul 5 12:09:21 2024 +0100 AArch64: remove aarch64_simd_vec_unpack_lo_ The fix for PR18127 reworked the uxtl to zip optimization. In doing so it undid the changes in aarch64_simd_vec_unpack_lo_ and this now no longer matches aarch64_simd_vec_unpack_hi_. It still works because the RTL generated by aarch64_simd_vec_unpack_lo_ overlaps with the general zero extend RTL and so because that one is listed before the lo pattern recog picks it instead. This removes aarch64_simd_vec_unpack_lo_. gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_simd_vec_unpack_lo_): Remove. (vec_unpack_lo__lo_" - [(set (match_operand: 0 "register_operand" "=w") -(ANY_EXTEND: (vec_select: - (match_operand:VQW 1 "register_operand" "w") - (match_operand:VQW 2 "vect_par_cnst_lo_half" "") - )))] - "TARGET_SIMD" - "xtl\t%0., %1." - [(set_attr "type" "neon_shift_imm_long")] -) - (define_insn_and_split "aarch64_simd_vec_unpack_hi_" [(set (match_operand: 0 "register_operand" "=w") (ANY_EXTEND: (vec_select: @@ -1952,14 +1941,11 @@ ) (define_expand "vec_unpack_lo_" - [(match_operand: 0 "register_operand") - (ANY_EXTEND: (match_operand:VQW 1 "register_operand"))] + [(set (match_operand: 0 "register_operand") + (ANY_EXTEND: (match_operand:VQW 1 "register_operand")))] "TARGET_SIMD" { -rtx p = aarch64_simd_vect_par_cnst_half (mode, , false); -emit_insn (gen_aarch64_simd_vec_unpack_lo_ (operands[0], - operands[1], p)); -DONE; +operands[1] = lowpart_subreg (mode, operands[1], mode); } ) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 6b106a72e49..469eb938953 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -23188,7 +23188,8 @@ aarch64_gen_shareable_zero (machine_mode mode) to split without that restriction and instead recombine shared zeros if they turn out not to be worthwhile. This would allow splits in single-block functions and would also cope more naturally with - rematerialization. */ + rematerialization. The downside of not doing this is that we lose the + optimizations for vector epilogues as well. */ bool aarch64_split_simd_shift_p (rtx_insn *insn)
[gcc r15-1856] AArch64: lower 2 reg TBL permutes with one zero register to 1 reg TBL.
https://gcc.gnu.org/g:97fcfeac3dcc433b792711fd840b92fa3e860733 commit r15-1856-g97fcfeac3dcc433b792711fd840b92fa3e860733 Author: Tamar Christina Date: Fri Jul 5 12:10:39 2024 +0100 AArch64: lower 2 reg TBL permutes with one zero register to 1 reg TBL. When a two reg TBL is performed with one operand being a zero vector we can instead use a single reg TBL and map the indices for accessing the zero vector to an out of range constant. On AArch64 out of range indices into a TBL have a defined semantics of setting the element to zero. Many uArches have a slower 2-reg TBL than 1-reg TBL. Before this change we had: typedef unsigned int v4si __attribute__ ((vector_size (16))); v4si f1 (v4si a) { v4si zeros = {0,0,0,0}; return __builtin_shufflevector (a, zeros, 0, 5, 1, 6); } which generates: f1: mov v30.16b, v0.16b moviv31.4s, 0 adrpx0, .LC0 ldr q0, [x0, #:lo12:.LC0] tbl v0.16b, {v30.16b - v31.16b}, v0.16b ret .LC0: .byte 0 .byte 1 .byte 2 .byte 3 .byte 20 .byte 21 .byte 22 .byte 23 .byte 4 .byte 5 .byte 6 .byte 7 .byte 24 .byte 25 .byte 26 .byte 27 and with the patch: f1: adrpx0, .LC0 ldr q31, [x0, #:lo12:.LC0] tbl v0.16b, {v0.16b}, v31.16b ret .LC0: .byte 0 .byte 1 .byte 2 .byte 3 .byte -1 .byte -1 .byte -1 .byte -1 .byte 4 .byte 5 .byte 6 .byte 7 .byte -1 .byte -1 .byte -1 .byte -1 This sequence is generated often by openmp and aside from the strict performance impact of this change, it also gives better register allocation as we no longer have the consecutive register limitation. gcc/ChangeLog: * config/aarch64/aarch64.cc (struct expand_vec_perm_d): Add zero_op0_p and zero_op_p1. (aarch64_evpc_tbl): Implement register value remapping. (aarch64_vectorize_vec_perm_const): Detect if operand is a zero dup before it's forced to a reg. gcc/testsuite/ChangeLog: * gcc.target/aarch64/tbl_with_zero_1.c: New test. * gcc.target/aarch64/tbl_with_zero_2.c: New test. Diff: --- gcc/config/aarch64/aarch64.cc | 40 ++ gcc/testsuite/gcc.target/aarch64/tbl_with_zero_1.c | 40 ++ gcc/testsuite/gcc.target/aarch64/tbl_with_zero_2.c | 20 +++ 3 files changed, 94 insertions(+), 6 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 469eb938953..7f0cc47d0f0 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -25413,6 +25413,7 @@ struct expand_vec_perm_d unsigned int vec_flags; unsigned int op_vec_flags; bool one_vector_p; + bool zero_op0_p, zero_op1_p; bool testing_p; }; @@ -25909,13 +25910,38 @@ aarch64_evpc_tbl (struct expand_vec_perm_d *d) /* to_constant is safe since this routine is specific to Advanced SIMD vectors. */ unsigned int nelt = d->perm.length ().to_constant (); + + /* If one register is the constant vector of 0 then we only need + a one reg TBL and we map any accesses to the vector of 0 to -1. We can't + do this earlier since vec_perm_indices clamps elements to within range so + we can only do it during codegen. */ + if (d->zero_op0_p) +d->op0 = d->op1; + else if (d->zero_op1_p) +d->op1 = d->op0; + for (unsigned int i = 0; i < nelt; ++i) -/* If big-endian and two vectors we end up with a weird mixed-endian - mode on NEON. Reverse the index within each word but not the word - itself. to_constant is safe because we checked is_constant above. */ -rperm[i] = GEN_INT (BYTES_BIG_ENDIAN - ? d->perm[i].to_constant () ^ (nelt - 1) - : d->perm[i].to_constant ()); +{ + auto val = d->perm[i].to_constant (); + + /* If we're selecting from a 0 vector, we can just use an out of range +index instead. */ + if ((d->zero_op0_p && val < nelt) || (d->zero_op1_p && val >= nelt)) + rperm[i] = constm1_rtx; + else + { + /* If we are remapping a zero register as the first parameter we need +to adjust the indices of the non-zero register. */ + if (d->zero_op0_p) + val = val % nelt; + + /* If big-endian and two vectors we end up with a
[gcc r15-2099] middle-end: fix 0 offset creation and folding [PR115936]
https://gcc.gnu.org/g:0135a90de5a99b51001b6152d8b548151ebfa1c3 commit r15-2099-g0135a90de5a99b51001b6152d8b548151ebfa1c3 Author: Tamar Christina Date: Wed Jul 17 16:22:14 2024 +0100 middle-end: fix 0 offset creation and folding [PR115936] As shown in PR115936 SCEV and IVOPTS create an invalidate IV when the IV is a pointer type: ivtmp.39_65 = ivtmp.39_59 + 0B; where the IVs are DI mode and the offset is a pointer. This comes from this weird candidate: Candidate 8: Var befor: ivtmp.39_59 Var after: ivtmp.39_65 Incr POS: before exit test IV struct: Type: sizetype Base: 0 Step: 0B Biv:N Overflowness wrto loop niter: No-overflow This IV was always created just ended up not being used. This is created by SCEV. simple_iv_with_niters in the case where no CHREC is found creates an IV with base == ev, offset == 0; however in this case EV is a POINTER_PLUS_EXPR and so the type is a pointer. it ends up creating an unusable expression. gcc/ChangeLog: PR tree-optimization/115936 * tree-scalar-evolution.cc (simple_iv_with_niters): Use sizetype for pointers. Diff: --- gcc/tree-scalar-evolution.cc | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc index 5aa95a2497a3..abb2bad77737 100644 --- a/gcc/tree-scalar-evolution.cc +++ b/gcc/tree-scalar-evolution.cc @@ -3243,7 +3243,11 @@ simple_iv_with_niters (class loop *wrto_loop, class loop *use_loop, if (tree_does_not_contain_chrecs (ev)) { iv->base = ev; - iv->step = build_int_cst (TREE_TYPE (ev), 0); + tree ev_type = TREE_TYPE (ev); + if (POINTER_TYPE_P (ev_type)) + ev_type = sizetype; + + iv->step = build_int_cst (ev_type, 0); iv->no_overflow = true; return true; }
[gcc r15-2191] middle-end: Implement conditonal store vectorizer pattern [PR115531]
https://gcc.gnu.org/g:af792f0226e479b165a49de5e8f9e1d16a4b26c0 commit r15-2191-gaf792f0226e479b165a49de5e8f9e1d16a4b26c0 Author: Tamar Christina Date: Mon Jul 22 10:26:14 2024 +0100 middle-end: Implement conditonal store vectorizer pattern [PR115531] This adds a conditional store optimization for the vectorizer as a pattern. The vectorizer already supports modifying memory accesses because of the pattern based gather/scatter recognition. Doing it in the vectorizer allows us to still keep the ability to vectorize such loops for architectures that don't have MASK_STORE support, whereas doing this in ifcvt makes us commit to MASK_STORE. Concretely for this loop: void foo1 (char *restrict a, int *restrict b, int *restrict c, int n, int stride) { if (stride <= 1) return; for (int i = 0; i < n; i++) { int res = c[i]; int t = b[i+stride]; if (a[i] != 0) res = t; c[i] = res; } } today we generate: .L3: ld1bz29.s, p7/z, [x0, x5] ld1wz31.s, p7/z, [x2, x5, lsl 2] ld1wz30.s, p7/z, [x1, x5, lsl 2] cmpne p15.b, p6/z, z29.b, #0 sel z30.s, p15, z30.s, z31.s st1wz30.s, p7, [x2, x5, lsl 2] add x5, x5, x4 whilelo p7.s, w5, w3 b.any .L3 which in gimple is: vect_res_18.9_68 = .MASK_LOAD (vectp_c.7_65, 32B, loop_mask_67); vect_t_20.12_74 = .MASK_LOAD (vectp.10_72, 32B, loop_mask_67); vect__9.15_77 = .MASK_LOAD (vectp_a.13_75, 8B, loop_mask_67); mask__34.16_79 = vect__9.15_77 != { 0, ... }; vect_res_11.17_80 = VEC_COND_EXPR ; .MASK_STORE (vectp_c.18_81, 32B, loop_mask_67, vect_res_11.17_80); A MASK_STORE is already conditional, so there's no need to perform the load of the old values and the VEC_COND_EXPR. This patch makes it so we generate: vect_res_18.9_68 = .MASK_LOAD (vectp_c.7_65, 32B, loop_mask_67); vect__9.15_77 = .MASK_LOAD (vectp_a.13_75, 8B, loop_mask_67); mask__34.16_79 = vect__9.15_77 != { 0, ... }; .MASK_STORE (vectp_c.18_81, 32B, mask__34.16_79, vect_res_18.9_68); which generates: .L3: ld1bz30.s, p7/z, [x0, x5] ld1wz31.s, p7/z, [x1, x5, lsl 2] cmpne p7.b, p7/z, z30.b, #0 st1wz31.s, p7, [x2, x5, lsl 2] add x5, x5, x4 whilelo p7.s, w5, w3 b.any .L3 gcc/ChangeLog: PR tree-optimization/115531 * tree-vect-patterns.cc (vect_cond_store_pattern_same_ref): New. (vect_recog_cond_store_pattern): New. (vect_vect_recog_func_ptrs): Use it. * target.def (conditional_operation_is_expensive): New. * doc/tm.texi: Regenerate. * doc/tm.texi.in: Document it. * targhooks.cc (default_conditional_operation_is_expensive): New. * targhooks.h (default_conditional_operation_is_expensive): New. Diff: --- gcc/doc/tm.texi | 7 ++ gcc/doc/tm.texi.in| 2 + gcc/target.def| 12 gcc/targhooks.cc | 8 +++ gcc/targhooks.h | 1 + gcc/tree-vect-patterns.cc | 159 ++ 6 files changed, 189 insertions(+) diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index f10d9a59c667..c7535d07f4dd 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -6449,6 +6449,13 @@ The default implementation returns a @code{MODE_VECTOR_INT} with the same size and number of elements as @var{mode}, if such a mode exists. @end deftypefn +@deftypefn {Target Hook} bool TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE (unsigned @var{ifn}) +This hook returns true if masked operation @var{ifn} (really of +type @code{internal_fn}) should be considered more expensive to use than +implementing the same operation without masking. GCC can then try to use +unconditional operations instead with extra selects. +@end deftypefn + @deftypefn {Target Hook} bool TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE (unsigned @var{ifn}) This hook returns true if masked internal function @var{ifn} (really of type @code{internal_fn}) should be considered expensive when the mask is diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 24596eb2f6b4..64cea3b1edaf 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -4290,6 +4290,8 @@ address; but often a machine-dependent strategy can generate better code. @hook TARGET_VECTORIZE_GET_MASK_MODE +@hook TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE + @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE @hook TARGET_VECTORIZE_CREATE_COSTS diff --git a/gcc/target.def b/gcc/target.def index ce4d1ecd58be..3de1aad4c84d 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -2033,6 +2033,18 @@ sam
[gcc r15-2192] AArch64: implement TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE [PR115531].
https://gcc.gnu.org/g:0c5c0c959c2e592b84739f19ca771fa69eb8dfee commit r15-2192-g0c5c0c959c2e592b84739f19ca771fa69eb8dfee Author: Tamar Christina Date: Mon Jul 22 10:28:19 2024 +0100 AArch64: implement TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE [PR115531]. This implements the new target hook indicating that for AArch64 when possible we prefer masked operations for any type vs doing LOAD + SELECT or SELECT + STORE. Thanks, Tamar gcc/ChangeLog: PR tree-optimization/115531 * config/aarch64/aarch64.cc (aarch64_conditional_operation_is_expensive): New. (TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE): New. gcc/testsuite/ChangeLog: PR tree-optimization/115531 * gcc.dg/vect/vect-conditional_store_1.c: New test. * gcc.dg/vect/vect-conditional_store_2.c: New test. * gcc.dg/vect/vect-conditional_store_3.c: New test. * gcc.dg/vect/vect-conditional_store_4.c: New test. Diff: --- gcc/config/aarch64/aarch64.cc | 12 ++ .../gcc.dg/vect/vect-conditional_store_1.c | 24 +++ .../gcc.dg/vect/vect-conditional_store_2.c | 24 +++ .../gcc.dg/vect/vect-conditional_store_3.c | 24 +++ .../gcc.dg/vect/vect-conditional_store_4.c | 28 ++ 5 files changed, 112 insertions(+) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 0d41a193ec18..89eb66348f77 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -28211,6 +28211,15 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool load, return true; } +/* Implement TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE. Assume that + predicated operations when available are beneficial. */ + +static bool +aarch64_conditional_operation_is_expensive (unsigned) +{ + return false; +} + /* Implement TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE. Assume for now that it isn't worth branching around empty masked ops (including masked stores). */ @@ -30898,6 +30907,9 @@ aarch64_libgcc_floating_mode_supported_p #define TARGET_VECTORIZE_RELATED_MODE aarch64_vectorize_related_mode #undef TARGET_VECTORIZE_GET_MASK_MODE #define TARGET_VECTORIZE_GET_MASK_MODE aarch64_get_mask_mode +#undef TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE +#define TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE \ + aarch64_conditional_operation_is_expensive #undef TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE #define TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE \ aarch64_empty_mask_is_expensive diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_1.c b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_1.c new file mode 100644 index ..03128b1f19b2 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_1.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target vect_masked_store } */ + +/* { dg-additional-options "-mavx2" { target avx2 } } */ +/* { dg-additional-options "-march=armv9-a" { target aarch64-*-* } } */ + +void foo1 (char *restrict a, int *restrict b, int *restrict c, int n, int stride) +{ + if (stride <= 1) +return; + + for (int i = 0; i < n; i++) +{ + int res = c[i]; + int t = b[i+stride]; + if (a[i] != 0) +res = t; + c[i] = res; +} +} + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ +/* { dg-final { scan-tree-dump-not "VEC_COND_EXPR " "vect" { target aarch64-*-* } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_2.c b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_2.c new file mode 100644 index ..a03898793c0b --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_2.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target vect_masked_store } */ + +/* { dg-additional-options "-mavx2" { target avx2 } } */ +/* { dg-additional-options "-march=armv9-a" { target aarch64-*-* } } */ + +void foo2 (char *restrict a, int *restrict b, int *restrict c, int n, int stride) +{ + if (stride <= 1) +return; + + for (int i = 0; i < n; i++) +{ + int res = c[i]; + int t = b[i+stride]; + if (a[i] != 0) +t = res; + c[i] = t; +} +} + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ +/* { dg-final { scan-tree-dump-not "VEC_COND_EXPR " "vect" { target aarch64-*-* } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_3.c b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_3.c new file mode 100644 index ..8a898755c1ca --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_3.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ +/* { dg
[gcc r14-9493] match.pd: Only merge truncation with conversion for -fno-signed-zeros
https://gcc.gnu.org/g:7dd3b2b09cbeb6712ec680a0445cb0ad41070423 commit r14-9493-g7dd3b2b09cbeb6712ec680a0445cb0ad41070423 Author: Joe Ramsay Date: Fri Mar 15 09:20:45 2024 + match.pd: Only merge truncation with conversion for -fno-signed-zeros This optimisation does not honour signed zeros, so should not be enabled except with -fno-signed-zeros. gcc/ChangeLog: * match.pd: Fix truncation pattern for -fno-signed-zeroes gcc/testsuite/ChangeLog: * gcc.target/aarch64/no_merge_trunc_signed_zero.c: New test. Diff: --- gcc/match.pd | 1 + .../aarch64/no_merge_trunc_signed_zero.c | 24 ++ 2 files changed, 25 insertions(+) diff --git a/gcc/match.pd b/gcc/match.pd index 9ce313323a3..15a1e7350d4 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -4858,6 +4858,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (simplify (float (fix_trunc @0)) (if (!flag_trapping_math + && !HONOR_SIGNED_ZEROS (type) && types_match (type, TREE_TYPE (@0)) && direct_internal_fn_supported_p (IFN_TRUNC, type, OPTIMIZE_FOR_BOTH)) diff --git a/gcc/testsuite/gcc.target/aarch64/no_merge_trunc_signed_zero.c b/gcc/testsuite/gcc.target/aarch64/no_merge_trunc_signed_zero.c new file mode 100644 index 000..b2c93e55567 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/no_merge_trunc_signed_zero.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fno-trapping-math -fsigned-zeros" } */ + +#include + +float +f1 (float x) +{ + return (int) rintf(x); +} + +double +f2 (double x) +{ + return (long) rint(x); +} + +/* { dg-final { scan-assembler "frintx\\ts\[0-9\]+, s\[0-9\]+" } } */ +/* { dg-final { scan-assembler "cvtzs\\ts\[0-9\]+, s\[0-9\]+" } } */ +/* { dg-final { scan-assembler "scvtf\\ts\[0-9\]+, s\[0-9\]+" } } */ +/* { dg-final { scan-assembler "frintx\\td\[0-9\]+, d\[0-9\]+" } } */ +/* { dg-final { scan-assembler "cvtzs\\td\[0-9\]+, d\[0-9\]+" } } */ +/* { dg-final { scan-assembler "scvtf\\td\[0-9\]+, d\[0-9\]+" } } */ +
[gcc r14-9969] middle-end: adjust loop upper bounds when peeling for gaps and early break [PR114403].
https://gcc.gnu.org/g:85002f8085c25bb3e74ab013581a74e7c7ae006b commit r14-9969-g85002f8085c25bb3e74ab013581a74e7c7ae006b Author: Tamar Christina Date: Mon Apr 15 12:06:21 2024 +0100 middle-end: adjust loop upper bounds when peeling for gaps and early break [PR114403]. This fixes a bug with the interaction between peeling for gaps and early break. Before I go further, I'll first explain how I understand this to work for loops with a single exit. When peeling for gaps we peel N < VF iterations to scalar. This happens by removing N iterations from the calculation of niters such that vect_iters * VF == niters is always false. In other words, when we exit the vector loop we always fall to the scalar loop. The loop bounds adjustment guarantees this. Because of this we potentially execute a vector loop iteration less. That is, if you're at the boundary condition where niters % VF by peeling one or more scalar iterations the vector loop executes one less. This is accounted for by the adjustments in vect_transform_loops. This adjustment happens differently based on whether the the vector loop can be partial or not: Peeling for gaps sets the bias to 0 and then: when not partial: we take the floor of (scalar_upper_bound / VF) - 1 to get the vector latch iteration count. when loop is partial: For a single exit this means the loop is masked, we take the ceil to account for the fact that the loop can handle the final partial iteration using masking. Note that there's no difference between ceil an floor on the boundary condition. There is a difference however when you're slightly above it. i.e. if scalar iterates 14 times and VF = 4 and we peel 1 iteration for gaps. The partial loop does ((13 + 0) / 4) - 1 == 2 vector iterations. and in effect the partial iteration is ignored and it's done as scalar. This is fine because the niters modification has capped the vector iteration at 2. So that when we reduce the induction values you end up entering the scalar code with ind_var.2 = ind_var.1 + 2 * VF. Now lets look at early breaks. To make it esier I'll focus on the specific testcase: char buffer[64]; __attribute__ ((noipa)) buff_t *copy (buff_t *first, buff_t *last) { char *buffer_ptr = buffer; char *const buffer_end = &buffer[SZ-1]; int store_size = sizeof(first->Val); while (first != last && (buffer_ptr + store_size) <= buffer_end) { const char *value_data = (const char *)(&first->Val); __builtin_memcpy(buffer_ptr, value_data, store_size); buffer_ptr += store_size; ++first; } if (first == last) return 0; return first; } Here the first, early exit is on the condition: (buffer_ptr + store_size) <= buffer_end and the main exit is on condition: first != last This is important, as this bug only manifests itself when the first exit has a known constant iteration count that's lower than the latch exit count. because buffer holds 64 bytes, and VF = 4, unroll = 2, we end up processing 16 bytes per iteration. So the exit has a known bounds of 8 + 1. The vectorizer correctly analizes this: Statement (exit)if (ivtmp_21 != 0) is executed at most 8 (bounded by 8) + 1 times in loop 1. and as a consequence the IV is bound by 9: # vect_vec_iv_.14_117 = PHI <_118(9), { 9, 8, 7, 6 }(20)> ... vect_ivtmp_21.16_124 = vect_vec_iv_.14_117 + { 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615 }; mask_patt_22.17_126 = vect_ivtmp_21.16_124 != { 0, 0, 0, 0 }; if (mask_patt_22.17_126 == { -1, -1, -1, -1 }) goto ; [88.89%] else goto ; [11.11%] The imporant bits are this: In this example the value of last - first = 416. the calculated vector iteration count, is: x = (((ptr2 - ptr1) - 16) / 16) + 1 = 27 the bounds generated, adjusting for gaps: x == (((x - 1) >> 2) << 2) which means we'll always fall through to the scalar code. as intended. Here are two key things to note: 1. In this loop, the early exit will always be the one taken. When it's taken we enter the scalar loop with the correct induction value to apply the gap peeling. 2. If the main exit is taken, the induction values assumes you've finished all vector iterations. i.e. it assumes you have completed 24 iterations, as we treat the main exit the same for normal loop vect and early break when not PEELED. This means the induction value is adjusted to ind_
[gcc r13-8604] AArch64: Do not allow SIMD clones with simdlen 1 [PR113552]
https://gcc.gnu.org/g:1e08e39c743692afdd5d3546b2223474beac1dbc commit r13-8604-g1e08e39c743692afdd5d3546b2223474beac1dbc Author: Tamar Christina Date: Mon Apr 15 12:11:48 2024 +0100 AArch64: Do not allow SIMD clones with simdlen 1 [PR113552] This is a backport of g:306713c953d509720dc394c43c0890548bb0ae07. The AArch64 vector PCS does not allow simd calls with simdlen 1, however due to a bug we currently do allow it for num == 0. This causes us to emit a symbol that doesn't exist and we fail to link. gcc/ChangeLog: PR tree-optimization/113552 * config/aarch64/aarch64.cc (aarch64_simd_clone_compute_vecsize_and_simdlen): Block simdlen 1. gcc/testsuite/ChangeLog: PR tree-optimization/113552 * gcc.target/aarch64/pr113552.c: New test. * gcc.target/aarch64/simd_pcs_attribute-3.c: Remove bogus check. Diff: --- gcc/config/aarch64/aarch64.cc | 16 +--- gcc/testsuite/gcc.target/aarch64/pr113552.c | 17 + gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c | 4 ++-- 3 files changed, 32 insertions(+), 5 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index f6d14cd791a..b8a4ab1b980 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -27029,7 +27029,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, bool explicit_p) { tree t, ret_type; - unsigned int elt_bits, count; + unsigned int elt_bits, count = 0; unsigned HOST_WIDE_INT const_simdlen; poly_uint64 vec_bits; @@ -27102,8 +27102,17 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, elt_bits = GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type)); if (known_eq (clonei->simdlen, 0U)) { - count = 2; - vec_bits = (num == 0 ? 64 : 128); + /* We don't support simdlen == 1. */ + if (known_eq (elt_bits, 64)) + { + count = 1; + vec_bits = 128; + } + else + { + count = 2; + vec_bits = (num == 0 ? 64 : 128); + } clonei->simdlen = exact_div (vec_bits, elt_bits); } else @@ -27123,6 +27132,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, return 0; } } + clonei->vecsize_int = vec_bits; clonei->vecsize_float = vec_bits; return count; diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c b/gcc/testsuite/gcc.target/aarch64/pr113552.c new file mode 100644 index 000..9c96b061ed2 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -march=armv8-a" } */ + +__attribute__ ((__simd__ ("notinbranch"), const)) +double cos (double); + +void foo (float *a, double *b) +{ +for (int i = 0; i < 12; i+=3) + { +b[i] = cos (5.0 * a[i]); +b[i+1] = cos (5.0 * a[i+1]); +b[i+2] = cos (5.0 * a[i+2]); + } +} + +/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c index 95f6a6803e8..c6dac6b104c 100644 --- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c +++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c @@ -18,7 +18,7 @@ double foo(double x) } /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */ -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */ +/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnM1v_foo} } } */ /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */ -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */ +/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnN1v_foo} } } */ /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */
[gcc r12-10329] AArch64: Do not allow SIMD clones with simdlen 1 [PR113552]
https://gcc.gnu.org/g:642cfd049780f03335da9fe0a51415f130232334 commit r12-10329-g642cfd049780f03335da9fe0a51415f130232334 Author: Tamar Christina Date: Mon Apr 15 12:16:53 2024 +0100 AArch64: Do not allow SIMD clones with simdlen 1 [PR113552] This is a backport of g:306713c953d509720dc394c43c0890548bb0ae07. The AArch64 vector PCS does not allow simd calls with simdlen 1, however due to a bug we currently do allow it for num == 0. This causes us to emit a symbol that doesn't exist and we fail to link. gcc/ChangeLog: PR tree-optimization/113552 * config/aarch64/aarch64.cc (aarch64_simd_clone_compute_vecsize_and_simdlen): Block simdlen 1. gcc/testsuite/ChangeLog: PR tree-optimization/113552 * gcc.target/aarch64/pr113552.c: New test. * gcc.target/aarch64/simd_pcs_attribute-3.c: Remove bogus check. Diff: --- gcc/config/aarch64/aarch64.cc | 16 +--- gcc/testsuite/gcc.target/aarch64/pr113552.c | 17 + gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c | 4 ++-- 3 files changed, 32 insertions(+), 5 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 2bbba323770..96976abdbf4 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -26898,7 +26898,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, tree base_type, int num) { tree t, ret_type; - unsigned int elt_bits, count; + unsigned int elt_bits, count = 0; unsigned HOST_WIDE_INT const_simdlen; poly_uint64 vec_bits; @@ -26966,8 +26966,17 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, elt_bits = GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type)); if (known_eq (clonei->simdlen, 0U)) { - count = 2; - vec_bits = (num == 0 ? 64 : 128); + /* We don't support simdlen == 1. */ + if (known_eq (elt_bits, 64)) + { + count = 1; + vec_bits = 128; + } + else + { + count = 2; + vec_bits = (num == 0 ? 64 : 128); + } clonei->simdlen = exact_div (vec_bits, elt_bits); } else @@ -26985,6 +26994,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, return 0; } } + clonei->vecsize_int = vec_bits; clonei->vecsize_float = vec_bits; return count; diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c b/gcc/testsuite/gcc.target/aarch64/pr113552.c new file mode 100644 index 000..9c96b061ed2 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -march=armv8-a" } */ + +__attribute__ ((__simd__ ("notinbranch"), const)) +double cos (double); + +void foo (float *a, double *b) +{ +for (int i = 0; i < 12; i+=3) + { +b[i] = cos (5.0 * a[i]); +b[i+1] = cos (5.0 * a[i+1]); +b[i+2] = cos (5.0 * a[i+2]); + } +} + +/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c index 95f6a6803e8..c6dac6b104c 100644 --- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c +++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c @@ -18,7 +18,7 @@ double foo(double x) } /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */ -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */ +/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnM1v_foo} } } */ /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */ -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */ +/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnN1v_foo} } } */ /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */
[gcc r11-11323] [AArch64]: Do not allow SIMD clones with simdlen 1 [PR113552]
https://gcc.gnu.org/g:0c2fcf3ddfe93d1f403962c4bacbb5d55ab7d19d commit r11-11323-g0c2fcf3ddfe93d1f403962c4bacbb5d55ab7d19d Author: Tamar Christina Date: Mon Apr 15 12:32:24 2024 +0100 [AArch64]: Do not allow SIMD clones with simdlen 1 [PR113552] This is a backport of g:306713c953d509720dc394c43c0890548bb0ae07. The AArch64 vector PCS does not allow simd calls with simdlen 1, however due to a bug we currently do allow it for num == 0. This causes us to emit a symbol that doesn't exist and we fail to link. gcc/ChangeLog: PR tree-optimization/113552 * config/aarch64/aarch64.c (aarch64_simd_clone_compute_vecsize_and_simdlen): Block simdlen 1. gcc/testsuite/ChangeLog: PR tree-optimization/113552 * gcc.target/aarch64/pr113552.c: New test. * gcc.target/aarch64/simd_pcs_attribute-3.c: Remove bogus check. Diff: --- gcc/config/aarch64/aarch64.c | 18 ++ gcc/testsuite/gcc.target/aarch64/pr113552.c| 17 + .../gcc.target/aarch64/simd_pcs_attribute-3.c | 4 ++-- 3 files changed, 33 insertions(+), 6 deletions(-) diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 9bbbc5043af..4df72339952 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -25556,7 +25556,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, tree base_type, int num) { tree t, ret_type; - unsigned int elt_bits, count; + unsigned int elt_bits, count = 0; unsigned HOST_WIDE_INT const_simdlen; poly_uint64 vec_bits; @@ -25624,11 +25624,20 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, elt_bits = GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type)); if (known_eq (clonei->simdlen, 0U)) { - count = 2; - vec_bits = (num == 0 ? 64 : 128); + /* We don't support simdlen == 1. */ + if (known_eq (elt_bits, 64)) + { + count = 1; + vec_bits = 128; + } + else + { + count = 2; + vec_bits = (num == 0 ? 64 : 128); + } clonei->simdlen = exact_div (vec_bits, elt_bits); } - else + else if (maybe_ne (clonei->simdlen, 1U)) { count = 1; vec_bits = clonei->simdlen * elt_bits; @@ -25643,6 +25652,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, return 0; } } + clonei->vecsize_int = vec_bits; clonei->vecsize_float = vec_bits; return count; diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c b/gcc/testsuite/gcc.target/aarch64/pr113552.c new file mode 100644 index 000..9c96b061ed2 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -march=armv8-a" } */ + +__attribute__ ((__simd__ ("notinbranch"), const)) +double cos (double); + +void foo (float *a, double *b) +{ +for (int i = 0; i < 12; i+=3) + { +b[i] = cos (5.0 * a[i]); +b[i+1] = cos (5.0 * a[i+1]); +b[i+2] = cos (5.0 * a[i+2]); + } +} + +/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c index 95f6a6803e8..c6dac6b104c 100644 --- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c +++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c @@ -18,7 +18,7 @@ double foo(double x) } /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */ -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */ +/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnM1v_foo} } } */ /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */ -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */ +/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnN1v_foo} } } */ /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */
[gcc r14-9997] testsuite: Fix data check loop on vect-early-break_124-pr114403.c
https://gcc.gnu.org/g:f438acf7ce2e6cb862cf62f2543c36639e2af233 commit r14-9997-gf438acf7ce2e6cb862cf62f2543c36639e2af233 Author: Tamar Christina Date: Tue Apr 16 20:56:26 2024 +0100 testsuite: Fix data check loop on vect-early-break_124-pr114403.c The testcase had the wrong indices in the buffer check loop. gcc/testsuite/ChangeLog: PR tree-optimization/114403 * gcc.dg/vect/vect-early-break_124-pr114403.c: Fix check loop. Diff: --- gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c index 1751296ab81..51abf245ccb 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c @@ -68,8 +68,8 @@ int main () int store_size = sizeof(PV); #pragma GCC novector - for (int i = 0; i < NUM - 1; i+=store_size) -if (0 != __builtin_memcmp (buffer+i, (char*)&tmp[i].Val, store_size)) + for (int i = 0; i < NUM - 1; i++) +if (0 != __builtin_memcmp (buffer+(i*store_size), (char*)&tmp[i].Val, store_size)) __builtin_abort (); return 0;
[gcc r14-10014] AArch64: remove reliance on register allocator for simd/gpreg costing. [PR114741]
https://gcc.gnu.org/g:a2f4be3dae04fa8606d1cc8451f0b9d450f7e6e6 commit r14-10014-ga2f4be3dae04fa8606d1cc8451f0b9d450f7e6e6 Author: Tamar Christina Date: Thu Apr 18 11:47:42 2024 +0100 AArch64: remove reliance on register allocator for simd/gpreg costing. [PR114741] In PR114741 we see that we have a regression in codegen when SVE is enable where the simple testcase: void foo(unsigned v, unsigned *p) { *p = v & 1; } generates foo: fmovs31, w0 and z31.s, z31.s, #1 str s31, [x1] ret instead of: foo: and w0, w0, 1 str w0, [x1] ret This causes an impact it not just codesize but also performance. This is caused by the use of the ^ constraint modifier in the pattern 3. The documentation states that this modifier should only have an effect on the alternative costing in that a particular alternative is to be preferred unless a non-psuedo reload is needed. The pattern was trying to convey that whenever both r and w are required, that it should prefer r unless a reload is needed. This is because if a reload is needed then we can construct the constants more flexibly on the SIMD side. We were using this so simplify the implementation and to get generic cases such as: double negabs (double x) { unsigned long long y; memcpy (&y, &x, sizeof(double)); y = y | (1UL << 63); memcpy (&x, &y, sizeof(double)); return x; } which don't go through an expander. However the implementation of ^ in the register allocator is not according to the documentation in that it also has an effect during coloring. During initial register class selection it applies a penalty to a class, similar to how ? does. In this example the penalty makes the use of GP regs expensive enough that it no longer considers them: r106: preferred FP_REGS, alternative NO_REGS, allocno FP_REGS ;;3--> b 0: i 9 r106=r105&0x1 :cortex_a53_slot_any:GENERAL_REGS+0(-1)FP_REGS+1(1)PR_LO_REGS+0(0) PR_HI_REGS+0(0):model 4 which is not the expected behavior. For GCC 14 this is a conservative fix. 1. we remove the ^ modifier from the logical optabs. 2. In order not to regress copysign we then move the copysign expansion to directly use the SIMD variant. Since copysign only supports floating point modes this is fine and no longer relies on the register allocator to select the right alternative. It once again regresses the general case, but this case wasn't optimized in earlier GCCs either so it's not a regression in GCC 14. This change gives strict better codegen than earlier GCCs and still optimizes the important cases. gcc/ChangeLog: PR target/114741 * config/aarch64/aarch64.md (3): Remove ^ from alt 2. (copysign3): Use SIMD version of IOR directly. gcc/testsuite/ChangeLog: PR target/114741 * gcc.target/aarch64/fneg-abs_2.c: Update codegen. * gcc.target/aarch64/fneg-abs_4.c: xfail for now. * gcc.target/aarch64/pr114741.c: New test. Diff: --- gcc/config/aarch64/aarch64.md | 23 + gcc/testsuite/gcc.target/aarch64/fneg-abs_2.c | 5 ++--- gcc/testsuite/gcc.target/aarch64/fneg-abs_4.c | 4 ++-- gcc/testsuite/gcc.target/aarch64/pr114741.c | 29 +++ 4 files changed, 48 insertions(+), 13 deletions(-) diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 385a669b9b3..dbde066f747 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -4811,7 +4811,7 @@ "" {@ [ cons: =0 , 1 , 2; attrs: type , arch ] [ r, %r , r; logic_reg , * ] \t%0, %1, %2 - [ rk , ^r , ; logic_imm , * ] \t%0, %1, %2 + [ rk , r , ; logic_imm , * ] \t%0, %1, %2 [ w, 0 , ; * , sve ] \t%Z0., %Z0., #%2 [ w, w , w; neon_logic , simd ] \t%0., %1., %2. } @@ -7192,22 +7192,29 @@ (match_operand:GPF 2 "nonmemory_operand")] "TARGET_SIMD" { - machine_mode int_mode = mode; - rtx bitmask = gen_reg_rtx (int_mode); - emit_move_insn (bitmask, GEN_INT (HOST_WIDE_INT_M1U - << (GET_MODE_BITSIZE (mode) - 1))); + rtx signbit_const = GEN_INT (HOST_WIDE_INT_M1U + << (GET_MODE_BITSIZE (mode) - 1)); /* copysign (x, -1) should instead be expanded as orr with the sign bit. */ rtx op2_elt = unwrap_const_vec_duplicate (operands[2]); if (GET_CODE (op2_elt) == CONST_DOUBLE && real_isneg (CONST_DOUBLE_REAL_VALUE (op2_e
[gcc r14-10040] middle-end: refactory vect_recog_absolute_difference to simplify flow [PR114769]
https://gcc.gnu.org/g:1216460e7023cd8ec49933866107417c70e933c9 commit r14-10040-g1216460e7023cd8ec49933866107417c70e933c9 Author: Tamar Christina Date: Fri Apr 19 15:22:13 2024 +0100 middle-end: refactory vect_recog_absolute_difference to simplify flow [PR114769] Hi All, As the reporter in PR114769 points out the control flow for the abd detection is hard to follow. This is because vect_recog_absolute_difference has two different ways it can return true. 1. It can return true when the widening operation is matched, in which case unprom is set, half_type is not NULL and diff_stmt is not set. 2. It can return true when the widening operation is not matched, but the stmt being checked is a minus. In this case unprom is not set, half_type is set to NULL and diff_stmt is set. This because to get to diff_stmt you have to dig through the abs statement and any possible promotions. This however leads to complicated uses of the function at the call sites as the exact semantic needs to be known to use it safely. vect_recog_absolute_difference has two callers: 1. vect_recog_sad_pattern where if you return true with unprom not set, then *half_type will be NULL. The call to vect_supportable_direct_optab_p will always reject it since there's no vector mode for NULL. Note that if looking at the dump files, the convention in the dump files have always been that we first indicate that a pattern could possibly be recognize and then check that it's supported. This change somewhat incorrectly makes the diagnostic message get printed for "invalid" patterns. 2. vect_recog_abd_pattern, where if half_type is NULL, it then uses diff_stmt to set them. This refactors the code, it now only has 1 success condition, and diff_stmt is always set to the minus statement in the abs if there is one. The function now only returns success if the widening minus is found, in which case unprom and half_type set. This then leaves it up to the caller to decide if they want to do anything with diff_stmt. Thanks, Tamar gcc/ChangeLog: PR tree-optimization/114769 * tree-vect-patterns.cc: (vect_recog_absolute_difference): Have only one success condition. (vect_recog_abd_pattern): Handle further checks if vect_recog_absolute_difference fails. Diff: --- gcc/tree-vect-patterns.cc | 43 --- 1 file changed, 16 insertions(+), 27 deletions(-) diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index 4f491c6b833..87c2acff386 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -797,8 +797,7 @@ vect_split_statement (vec_info *vinfo, stmt_vec_info stmt2_info, tree new_rhs, HALF_TYPE and UNPROM will be set should the statement be found to be a widened operation. DIFF_STMT will be set to the MINUS_EXPR - statement that precedes the ABS_STMT unless vect_widened_op_tree - succeeds. + statement that precedes the ABS_STMT if it is a MINUS_EXPR.. */ static bool vect_recog_absolute_difference (vec_info *vinfo, gassign *abs_stmt, @@ -843,6 +842,12 @@ vect_recog_absolute_difference (vec_info *vinfo, gassign *abs_stmt, if (!diff_stmt_vinfo) return false; + gassign *diff = dyn_cast (STMT_VINFO_STMT (diff_stmt_vinfo)); + if (diff_stmt && diff + && gimple_assign_rhs_code (diff) == MINUS_EXPR + && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (abs_oprnd))) +*diff_stmt = diff; + /* FORNOW. Can continue analyzing the def-use chain when this stmt in a phi inside the loop (in case we are analyzing an outer-loop). */ if (vect_widened_op_tree (vinfo, diff_stmt_vinfo, @@ -850,17 +855,6 @@ vect_recog_absolute_difference (vec_info *vinfo, gassign *abs_stmt, false, 2, unprom, half_type)) return true; - /* Failed to find a widen operation so we check for a regular MINUS_EXPR. */ - gassign *diff = dyn_cast (STMT_VINFO_STMT (diff_stmt_vinfo)); - if (diff_stmt && diff - && gimple_assign_rhs_code (diff) == MINUS_EXPR - && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (abs_oprnd))) -{ - *diff_stmt = diff; - *half_type = NULL_TREE; - return true; -} - return false; } @@ -1499,27 +1493,22 @@ vect_recog_abd_pattern (vec_info *vinfo, tree out_type = TREE_TYPE (gimple_assign_lhs (last_stmt)); vect_unpromoted_value unprom[2]; - gassign *diff_stmt; - tree half_type; - if (!vect_recog_absolute_difference (vinfo, last_stmt, &half_type, + gassign *diff_stmt = NULL; + tree abd_in_type; + if (!vect_recog_absolute_difference (vinfo, last_stmt, &abd_in_type, unprom, &diff_stmt)) -return NULL; - - tree abd_in_type, abd_out_type; - -
[gcc r15-2336] middle-end: check for vector mode before calling get_mask_mode [PR116074]
https://gcc.gnu.org/g:29e4e4bdb674118b898d50ce7751c183aa0a44ee commit r15-2336-g29e4e4bdb674118b898d50ce7751c183aa0a44ee Author: Tamar Christina Date: Fri Jul 26 13:02:53 2024 +0100 middle-end: check for vector mode before calling get_mask_mode [PR116074] For historical reasons AArch64 has TI mode vector types but does not consider TImode a vector mode. What's happening in the PR is that get_vectype_for_scalar_type is returning vector(1) TImode for a TImode scalar. This then fails when we call targetm.vectorize.get_mask_mode (vecmode).exists (&) on the TYPE_MODE. This checks for vector mode before using the results of get_vectype_for_scalar_type. gcc/ChangeLog: PR target/116074 * tree-vect-patterns.cc (vect_recog_cond_store_pattern): Check vector mode. gcc/testsuite/ChangeLog: PR target/116074 * g++.target/aarch64/pr116074.C: New test. Diff: --- gcc/testsuite/g++.target/aarch64/pr116074.C | 24 gcc/tree-vect-patterns.cc | 3 ++- 2 files changed, 26 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/g++.target/aarch64/pr116074.C b/gcc/testsuite/g++.target/aarch64/pr116074.C new file mode 100644 index ..54cf561510c4 --- /dev/null +++ b/gcc/testsuite/g++.target/aarch64/pr116074.C @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O3" } */ + +int m[40]; + +template struct j { + int length; + k *e; + void operator[](int) { +if (length) + __builtin___memcpy_chk(m, m+3, sizeof (k), -1); + } +}; + +j> o; + +int *q; + +void ao(int i) { + for (; i > 0; i--) { +o[1]; +*q = 1; + } +} diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index b0821c74c1d8..5fbd1a4fa6b4 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -6624,7 +6624,8 @@ vect_recog_cond_store_pattern (vec_info *vinfo, machine_mode mask_mode; machine_mode vecmode = TYPE_MODE (vectype); - if (targetm.vectorize.conditional_operation_is_expensive (IFN_MASK_STORE) + if (!VECTOR_MODE_P (vecmode) + || targetm.vectorize.conditional_operation_is_expensive (IFN_MASK_STORE) || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode) || !can_vec_mask_load_store_p (vecmode, mask_mode, false)) return NULL;
[gcc r15-2638] AArch64: Update Neoverse V2 cost model to release costs
https://gcc.gnu.org/g:7e7c1e38829d45667748db68f15584bdd16fcad6 commit r15-2638-g7e7c1e38829d45667748db68f15584bdd16fcad6 Author: Tamar Christina Date: Thu Aug 1 16:53:22 2024 +0100 AArch64: Update Neoverse V2 cost model to release costs This updates the cost for Neoverse V2 to reflect the updated Software Optimization Guide. It also makes Cortex-X3 use the Neoverse V2 cost model. gcc/ChangeLog: * config/aarch64/aarch64-cores.def (cortex-x3): Use Neoverse-V2 costs. * config/aarch64/tuning_models/neoversev2.h: Update costs. Diff: --- gcc/config/aarch64/aarch64-cores.def | 2 +- gcc/config/aarch64/tuning_models/neoversev2.h | 38 +-- 2 files changed, 20 insertions(+), 20 deletions(-) diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def index e58bc0f27de3..34307fe0c172 100644 --- a/gcc/config/aarch64/aarch64-cores.def +++ b/gcc/config/aarch64/aarch64-cores.def @@ -186,7 +186,7 @@ AARCH64_CORE("cortex-a720", cortexa720, cortexa57, V9_2A, (SVE2_BITPERM, MEMTA AARCH64_CORE("cortex-x2", cortexx2, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd48, -1) -AARCH64_CORE("cortex-x3", cortexx3, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd4e, -1) +AARCH64_CORE("cortex-x3", cortexx3, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8MM, BF16), neoversev2, 0x41, 0xd4e, -1) AARCH64_CORE("cortex-x4", cortexx4, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversen2, 0x41, 0xd81, -1) diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h b/gcc/config/aarch64/tuning_models/neoversev2.h index f76e4ef358f7..c9c3019dd01a 100644 --- a/gcc/config/aarch64/tuning_models/neoversev2.h +++ b/gcc/config/aarch64/tuning_models/neoversev2.h @@ -57,13 +57,13 @@ static const advsimd_vec_cost neoversev2_advsimd_vector_cost = 2, /* ld2_st2_permute_cost */ 2, /* ld3_st3_permute_cost */ 3, /* ld4_st4_permute_cost */ - 3, /* permute_cost */ + 2, /* permute_cost */ 4, /* reduc_i8_cost */ 4, /* reduc_i16_cost */ 2, /* reduc_i32_cost */ 2, /* reduc_i64_cost */ 6, /* reduc_f16_cost */ - 3, /* reduc_f32_cost */ + 4, /* reduc_f32_cost */ 2, /* reduc_f64_cost */ 2, /* store_elt_extra_cost */ /* This value is just inherited from the Cortex-A57 table. */ @@ -86,22 +86,22 @@ static const sve_vec_cost neoversev2_sve_vector_cost = { 2, /* int_stmt_cost */ 2, /* fp_stmt_cost */ -3, /* ld2_st2_permute_cost */ +2, /* ld2_st2_permute_cost */ 3, /* ld3_st3_permute_cost */ -4, /* ld4_st4_permute_cost */ -3, /* permute_cost */ +3, /* ld4_st4_permute_cost */ +2, /* permute_cost */ /* Theoretically, a reduction involving 15 scalar ADDs could - complete in ~3 cycles and would have a cost of 15. [SU]ADDV - completes in 11 cycles, so give it a cost of 15 + 8. */ -21, /* reduc_i8_cost */ -/* Likewise for 7 scalar ADDs (~2 cycles) vs. 9: 7 + 7. */ -14, /* reduc_i16_cost */ -/* Likewise for 3 scalar ADDs (~2 cycles) vs. 8: 3 + 4. */ + complete in ~5 cycles and would have a cost of 15. [SU]ADDV + completes in 9 cycles, so give it a cost of 15 + 4. */ +19, /* reduc_i8_cost */ +/* Likewise for 7 scalar ADDs (~3 cycles) vs. 8: 7 + 5. */ +12, /* reduc_i16_cost */ +/* Likewise for 3 scalar ADDs (~2 cycles) vs. 6: 3 + 4. */ 7, /* reduc_i32_cost */ -/* Likewise for 1 scalar ADD (~1 cycles) vs. 2: 1 + 1. */ -2, /* reduc_i64_cost */ +/* Likewise for 1 scalar ADDs (~1 cycles) vs. 4: 1 + 3. */ +4, /* reduc_i64_cost */ /* Theoretically, a reduction involving 7 scalar FADDs could - complete in ~6 cycles and would have a cost of 14. FADDV + complete in ~6 cycles and would have a cost of 14. FADDV completes in 8 cycles, so give it a cost of 14 + 2. */ 16, /* reduc_f16_cost */ /* Likewise for 3 scalar FADDs (~4 cycles) vs. 6: 6 + 2. */ @@ -127,7 +127,7 @@ static const sve_vec_cost neoversev2_sve_vector_cost = /* A strided Advanced SIMD x64 load would take two parallel FP loads (8 cycles) plus an insertion (2 cycles). Assume a 64-bit SVE gather is 1 cycle more. The Advanced SIMD version is costed as 2 scalar loads - (cost 8) and a vec_construct (cost 2). Add a full vector operation + (cost 8) and a vec_construct (cost 4). Add a full vector operation (cost 2) to that, to avoid the difference being lost in rounding. There is no easy comparison between a strided Advanced SIMD x32 load @@ -165,14 +165,14 @@ static const aarch64_sve_vec_issue_info neoversev2_sve_issue_info = { { { - 3, /* loads_per_cycle */ + 3, /* loads_stores_per_cycle */ 2, /* stores_per_cycle */ 4, /* general_ops_per_cycle */ 0, /* fp_simd_load_general_ops */ 1 /* fp_simd_sto
[gcc r15-2640] AArch64: Add Neoverse V3AE core definition and cost model
https://gcc.gnu.org/g:7ca2a803c4a0d8e894f0b36625a2c838c54fb4cd commit r15-2640-g7ca2a803c4a0d8e894f0b36625a2c838c54fb4cd Author: Tamar Christina Date: Thu Aug 1 16:53:59 2024 +0100 AArch64: Add Neoverse V3AE core definition and cost model This adds a cost model and core definition for Neoverse V3AE. gcc/ChangeLog: * config/aarch64/aarch64-cores.def (neoverse-v3ae): New. * config/aarch64/aarch64-tune.md: Regenerate. * config/aarch64/tuning_models/neoversev3ae.h: New file. * config/aarch64/aarch64.cc: Use it. * doc/invoke.texi: Document it. Diff: --- gcc/config/aarch64/aarch64-cores.def| 1 + gcc/config/aarch64/aarch64-tune.md | 2 +- gcc/config/aarch64/aarch64.cc | 1 + gcc/config/aarch64/tuning_models/neoversev3ae.h | 246 gcc/doc/invoke.texi | 2 +- 5 files changed, 250 insertions(+), 2 deletions(-) diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def index 96c74657a199..092be6eb01e6 100644 --- a/gcc/config/aarch64/aarch64-cores.def +++ b/gcc/config/aarch64/aarch64-cores.def @@ -196,6 +196,7 @@ AARCH64_CORE("cobalt-100", cobalt100, cortexa57, V9A, (I8MM, BF16, SVE2_BITPER AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1) AARCH64_CORE("grace", grace, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE), neoversev2, 0x41, 0xd4f, -1) AARCH64_CORE("neoverse-v3", neoversev3, cortexa57, V9_2A, (SVE2_BITPERM, RNG, LS64, MEMTAG, PROFILE), neoversev3, 0x41, 0xd84, -1) +AARCH64_CORE("neoverse-v3ae", neoversev3ae, cortexa57, V9_2A, (SVE2_BITPERM, RNG, LS64, MEMTAG, PROFILE), neoversev3ae, 0x41, 0xd83, -1) AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1) diff --git a/gcc/config/aarch64/aarch64-tune.md b/gcc/config/aarch64/aarch64-tune.md index 0c3339b53e42..b02e891086cc 100644 --- a/gcc/config/aarch64/aarch64-tune.md +++ b/gcc/config/aarch64/aarch64-tune.md @@ -1,5 +1,5 @@ ;; -*- buffer-read-only: t -*- ;; Generated automatically by gentune.sh from aarch64-cores.def (define_attr "tune" - "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,neoversev3,demeter,generic,generic_armv8_a,generic_armv9_a" + "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,neoversev3,neoversev3ae,demeter,generic,generic_armv8_a,generic_armv9_a" (const (symbol_ref "((enum attr_tune) aarch64_tune)"))) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index f29dcf7fe173..54b27cdff43b 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -411,6 +411,7 @@ static const struct aarch64_flag_desc aarch64_tuning_flags[] = #include "tuning_models/neoversen2.h" #include "tuning_models/neoversev2.h" #include "tuning_models/neoversev3.h" +#include "tuning_models/neoversev3ae.h" #include "tuning_models/a64fx.h" /* Support for fine-grained override of the tuning structures. */ diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h b/gcc/config/aarch64/tuning_models/neoversev3ae.h new file mode 100644 index ..96d7ccf03cd9 --- /dev/null +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h @@ -0,0 +1,246 @@ +/* T
[gcc r15-2641] AArch64: Add Neoverse N3 and Cortex-A725 core definition and cost model
https://gcc.gnu.org/g:488395f9513233944e488fae59372da4de4324c3 commit r15-2641-g488395f9513233944e488fae59372da4de4324c3 Author: Tamar Christina Date: Thu Aug 1 16:54:15 2024 +0100 AArch64: Add Neoverse N3 and Cortex-A725 core definition and cost model This adds a cost model and core definition for Neoverse N3 and Cortex-A725. It also makes Cortex-A725 use the Neoverse N3 cost model. gcc/ChangeLog: * config/aarch64/aarch64-cores.def (neoverse-n3, cortex-a725): New. * config/aarch64/aarch64-tune.md: Regenerate. * config/aarch64/tuning_models/neoversen3.h: New file. * config/aarch64/aarch64.cc: Use it. * doc/invoke.texi: Document it. Diff: --- gcc/config/aarch64/aarch64-cores.def | 2 + gcc/config/aarch64/aarch64-tune.md| 2 +- gcc/config/aarch64/aarch64.cc | 1 + gcc/config/aarch64/tuning_models/neoversen3.h | 245 ++ gcc/doc/invoke.texi | 3 +- 5 files changed, 251 insertions(+), 2 deletions(-) diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def index 092be6eb01e6..4d6f5a701eee 100644 --- a/gcc/config/aarch64/aarch64-cores.def +++ b/gcc/config/aarch64/aarch64-cores.def @@ -183,6 +183,7 @@ AARCH64_CORE("cortex-a710", cortexa710, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, AARCH64_CORE("cortex-a715", cortexa715, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd4d, -1) AARCH64_CORE("cortex-a720", cortexa720, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversen2, 0x41, 0xd81, -1) +AARCH64_CORE("cortex-a725", cortexa725, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversen3, 0x41, 0xd87, -1) AARCH64_CORE("cortex-x2", cortexx2, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd48, -1) @@ -192,6 +193,7 @@ AARCH64_CORE("cortex-x4", cortexx4, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, P AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1) AARCH64_CORE("cobalt-100", cobalt100, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1) +AARCH64_CORE("neoverse-n3", neoversen3, cortexa57, V9_2A, (SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen3, 0x41, 0xd8e, -1) AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1) AARCH64_CORE("grace", grace, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE), neoversev2, 0x41, 0xd4f, -1) diff --git a/gcc/config/aarch64/aarch64-tune.md b/gcc/config/aarch64/aarch64-tune.md index b02e891086cc..d71c631b01c7 100644 --- a/gcc/config/aarch64/aarch64-tune.md +++ b/gcc/config/aarch64/aarch64-tune.md @@ -1,5 +1,5 @@ ;; -*- buffer-read-only: t -*- ;; Generated automatically by gentune.sh from aarch64-cores.def (define_attr "tune" - "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,neoversev3,neoversev3ae,demeter,generic,generic_armv8_a,generic_armv9_a" + "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexa725,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversen3,neoversev2,grace,neoversev3,neoversev3ae,demeter,generic,generic_armv8_a,generic_armv9_a" (const (symbol_ref "((enum attr_tune) aarch64_tune)"))) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aar
[gcc r15-2642] AArch64: Update Generic Armv9-a cost model to release costs
https://gcc.gnu.org/g:3b0bac451110bf1591ce9085b66857448d099a8c commit r15-2642-g3b0bac451110bf1591ce9085b66857448d099a8c Author: Tamar Christina Date: Thu Aug 1 16:54:31 2024 +0100 AArch64: Update Generic Armv9-a cost model to release costs this updates the costs for gener-armv9-a based on the updated costs for Neoverse V2 and Neoverse N2. gcc/ChangeLog: * config/aarch64/tuning_models/generic_armv9_a.h: Update costs. Diff: --- gcc/config/aarch64/tuning_models/generic_armv9_a.h | 50 +++--- 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h b/gcc/config/aarch64/tuning_models/generic_armv9_a.h index 0a08c4b43473..7156dbe5787e 100644 --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h @@ -58,7 +58,7 @@ static const advsimd_vec_cost generic_armv9_a_advsimd_vector_cost = 2, /* ld2_st2_permute_cost */ 2, /* ld3_st3_permute_cost */ 3, /* ld4_st4_permute_cost */ - 3, /* permute_cost */ + 2, /* permute_cost */ 4, /* reduc_i8_cost */ 4, /* reduc_i16_cost */ 2, /* reduc_i32_cost */ @@ -87,28 +87,28 @@ static const sve_vec_cost generic_armv9_a_sve_vector_cost = { 2, /* int_stmt_cost */ 2, /* fp_stmt_cost */ -3, /* ld2_st2_permute_cost */ -4, /* ld3_st3_permute_cost */ -4, /* ld4_st4_permute_cost */ -3, /* permute_cost */ +2, /* ld2_st2_permute_cost */ +3, /* ld3_st3_permute_cost */ +3, /* ld4_st4_permute_cost */ +2, /* permute_cost */ /* Theoretically, a reduction involving 15 scalar ADDs could complete in ~5 cycles and would have a cost of 15. [SU]ADDV - completes in 11 cycles, so give it a cost of 15 + 6. */ -21, /* reduc_i8_cost */ -/* Likewise for 7 scalar ADDs (~3 cycles) vs. 9: 7 + 6. */ -13, /* reduc_i16_cost */ -/* Likewise for 3 scalar ADDs (~2 cycles) vs. 8: 3 + 6. */ -9, /* reduc_i32_cost */ -/* Likewise for 1 scalar ADD (~1 cycles) vs. 2: 1 + 1. */ -2, /* reduc_i64_cost */ + completes in 9 cycles, so give it a cost of 15 + 4. */ +19, /* reduc_i8_cost */ +/* Likewise for 7 scalar ADDs (~3 cycles) vs. 8: 7 + 5. */ +12, /* reduc_i16_cost */ +/* Likewise for 3 scalar ADDs (~2 cycles) vs. 6: 3 + 4. */ +7, /* reduc_i32_cost */ +/* Likewise for 1 scalar ADDs (~1 cycles) vs. 4: 1 + 3. */ +4, /* reduc_i64_cost */ /* Theoretically, a reduction involving 7 scalar FADDs could - complete in ~8 cycles and would have a cost of 14. FADDV - completes in 6 cycles, so give it a cost of 14 - 2. */ -12, /* reduc_f16_cost */ -/* Likewise for 3 scalar FADDs (~4 cycles) vs. 4: 6 - 0. */ -6, /* reduc_f32_cost */ -/* Likewise for 1 scalar FADD (~2 cycles) vs. 2: 2 - 0. */ -2, /* reduc_f64_cost */ + complete in ~8 cycles and would have a cost of 14. FADDV + completes in 8 cycles, so give it a cost of 14 + 0. */ +14, /* reduc_f16_cost */ +/* Likewise for 3 scalar FADDs (~4 cycles) vs. 6: 6 + 2. */ +8, /* reduc_f32_cost */ +/* Likewise for 1 scalar FADD (~2 cycles) vs. 4: 2 + 2. */ +4, /* reduc_f64_cost */ 2, /* store_elt_extra_cost */ /* This value is just inherited from the Cortex-A57 table. */ 8, /* vec_to_scalar_cost */ @@ -128,7 +128,7 @@ static const sve_vec_cost generic_armv9_a_sve_vector_cost = /* A strided Advanced SIMD x64 load would take two parallel FP loads (8 cycles) plus an insertion (2 cycles). Assume a 64-bit SVE gather is 1 cycle more. The Advanced SIMD version is costed as 2 scalar loads - (cost 8) and a vec_construct (cost 2). Add a full vector operation + (cost 8) and a vec_construct (cost 4). Add a full vector operation (cost 2) to that, to avoid the difference being lost in rounding. There is no easy comparison between a strided Advanced SIMD x32 load @@ -166,14 +166,14 @@ static const aarch64_sve_vec_issue_info generic_armv9_a_sve_issue_info = { { { - 3, /* loads_per_cycle */ + 3, /* loads_stores_per_cycle */ 2, /* stores_per_cycle */ 2, /* general_ops_per_cycle */ 0, /* fp_simd_load_general_ops */ 1 /* fp_simd_store_general_ops */ }, 2, /* ld2_st2_general_ops */ -3, /* ld3_st3_general_ops */ +2, /* ld3_st3_general_ops */ 3 /* ld4_st4_general_ops */ }, 2, /* pred_ops_per_cycle */ @@ -191,7 +191,7 @@ static const aarch64_vec_issue_info generic_armv9_a_vec_issue_info = &generic_armv9_a_sve_issue_info }; -/* Neoverse N2 costs for vector insn classes. */ +/* Generic_armv9_a costs for vector insn classes. */ static const struct cpu_vector_cost generic_armv9_a_vector_cost = { 1, /* scalar_int_stmt_cost */ @@ -228,7 +228,7 @@ static const struct tune_params generic_armv9_a_tunings = "32:16", /* loop_ali
[gcc r15-2643] AArch64: Update Neoverse N2 cost model to release costs
https://gcc.gnu.org/g:f88cb43aed5c7db5676732c755ec4fee960ecbed commit r15-2643-gf88cb43aed5c7db5676732c755ec4fee960ecbed Author: Tamar Christina Date: Thu Aug 1 16:54:49 2024 +0100 AArch64: Update Neoverse N2 cost model to release costs This updates the cost for Neoverse N2 to reflect the updated Software Optimization Guide. gcc/ChangeLog: * config/aarch64/tuning_models/neoversen2.h: Update costs. Diff: --- gcc/config/aarch64/tuning_models/neoversen2.h | 46 +-- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h b/gcc/config/aarch64/tuning_models/neoversen2.h index be9a48ac3adc..d41e714aa045 100644 --- a/gcc/config/aarch64/tuning_models/neoversen2.h +++ b/gcc/config/aarch64/tuning_models/neoversen2.h @@ -57,7 +57,7 @@ static const advsimd_vec_cost neoversen2_advsimd_vector_cost = 2, /* ld2_st2_permute_cost */ 2, /* ld3_st3_permute_cost */ 3, /* ld4_st4_permute_cost */ - 3, /* permute_cost */ + 2, /* permute_cost */ 4, /* reduc_i8_cost */ 4, /* reduc_i16_cost */ 2, /* reduc_i32_cost */ @@ -86,27 +86,27 @@ static const sve_vec_cost neoversen2_sve_vector_cost = { 2, /* int_stmt_cost */ 2, /* fp_stmt_cost */ -3, /* ld2_st2_permute_cost */ -4, /* ld3_st3_permute_cost */ -4, /* ld4_st4_permute_cost */ -3, /* permute_cost */ +2, /* ld2_st2_permute_cost */ +3, /* ld3_st3_permute_cost */ +3, /* ld4_st4_permute_cost */ +2, /* permute_cost */ /* Theoretically, a reduction involving 15 scalar ADDs could complete in ~5 cycles and would have a cost of 15. [SU]ADDV - completes in 11 cycles, so give it a cost of 15 + 6. */ -21, /* reduc_i8_cost */ -/* Likewise for 7 scalar ADDs (~3 cycles) vs. 9: 7 + 6. */ -13, /* reduc_i16_cost */ -/* Likewise for 3 scalar ADDs (~2 cycles) vs. 8: 3 + 6. */ -9, /* reduc_i32_cost */ -/* Likewise for 1 scalar ADD (~1 cycles) vs. 2: 1 + 1. */ -2, /* reduc_i64_cost */ + completes in 9 cycles, so give it a cost of 15 + 4. */ +19, /* reduc_i8_cost */ +/* Likewise for 7 scalar ADDs (~3 cycles) vs. 8: 7 + 5. */ +12, /* reduc_i16_cost */ +/* Likewise for 3 scalar ADDs (~2 cycles) vs. 6: 3 + 4. */ +7, /* reduc_i32_cost */ +/* Likewise for 1 scalar ADDs (~1 cycles) vs. 4: 1 + 3. */ +4, /* reduc_i64_cost */ /* Theoretically, a reduction involving 7 scalar FADDs could - complete in ~8 cycles and would have a cost of 14. FADDV - completes in 6 cycles, so give it a cost of 14 - 2. */ + complete in ~8 cycles and would have a cost of 14. FADDV + completes in 6 cycles, so give it a cost of 14 + -2. */ 12, /* reduc_f16_cost */ -/* Likewise for 3 scalar FADDs (~4 cycles) vs. 4: 6 - 0. */ +/* Likewise for 3 scalar FADDs (~4 cycles) vs. 4: 6 + 0. */ 6, /* reduc_f32_cost */ -/* Likewise for 1 scalar FADD (~2 cycles) vs. 2: 2 - 0. */ +/* Likewise for 1 scalar FADD (~2 cycles) vs. 2: 2 + 0. */ 2, /* reduc_f64_cost */ 2, /* store_elt_extra_cost */ /* This value is just inherited from the Cortex-A57 table. */ @@ -127,7 +127,7 @@ static const sve_vec_cost neoversen2_sve_vector_cost = /* A strided Advanced SIMD x64 load would take two parallel FP loads (8 cycles) plus an insertion (2 cycles). Assume a 64-bit SVE gather is 1 cycle more. The Advanced SIMD version is costed as 2 scalar loads - (cost 8) and a vec_construct (cost 2). Add a full vector operation + (cost 8) and a vec_construct (cost 4). Add a full vector operation (cost 2) to that, to avoid the difference being lost in rounding. There is no easy comparison between a strided Advanced SIMD x32 load @@ -165,14 +165,14 @@ static const aarch64_sve_vec_issue_info neoversen2_sve_issue_info = { { { - 3, /* loads_per_cycle */ + 3, /* loads_stores_per_cycle */ 2, /* stores_per_cycle */ 2, /* general_ops_per_cycle */ 0, /* fp_simd_load_general_ops */ 1 /* fp_simd_store_general_ops */ }, 2, /* ld2_st2_general_ops */ -3, /* ld3_st3_general_ops */ +2, /* ld3_st3_general_ops */ 3 /* ld4_st4_general_ops */ }, 2, /* pred_ops_per_cycle */ @@ -190,7 +190,7 @@ static const aarch64_vec_issue_info neoversen2_vec_issue_info = &neoversen2_sve_issue_info }; -/* Neoverse N2 costs for vector insn classes. */ +/* Neoversen2 costs for vector insn classes. */ static const struct cpu_vector_cost neoversen2_vector_cost = { 1, /* scalar_int_stmt_cost */ @@ -220,7 +220,7 @@ static const struct tune_params neoversen2_tunings = 6, /* load_pred. */ 1 /* store_pred. */ }, /* memmov_cost. */ - 3, /* issue_rate */ + 5, /* issue_rate */ (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops */ "32:16", /* function_align. *
[gcc r15-2639] AArch64: Add Neoverse V3 core definition and cost model
https://gcc.gnu.org/g:729000b90300a31ef9ed405635a0be761c5e168b commit r15-2639-g729000b90300a31ef9ed405635a0be761c5e168b Author: Tamar Christina Date: Thu Aug 1 16:53:41 2024 +0100 AArch64: Add Neoverse V3 core definition and cost model This adds a cost model and core definition for Neoverse V3. It also makes Cortex-X4 use the Neoverse V3 cost model. gcc/ChangeLog: * config/aarch64/aarch64-cores.def (cortex-x4): Update. (neoverse-v3): New. * config/aarch64/aarch64-tune.md: Regenerate. * config/aarch64/tuning_models/neoversev3.h: New file. * config/aarch64/aarch64.cc: Use it. * doc/invoke.texi: Document it. Diff: --- gcc/config/aarch64/aarch64-cores.def | 3 +- gcc/config/aarch64/aarch64-tune.md| 2 +- gcc/config/aarch64/aarch64.cc | 1 + gcc/config/aarch64/tuning_models/neoversev3.h | 246 ++ gcc/doc/invoke.texi | 1 + 5 files changed, 251 insertions(+), 2 deletions(-) diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def index 34307fe0c172..96c74657a199 100644 --- a/gcc/config/aarch64/aarch64-cores.def +++ b/gcc/config/aarch64/aarch64-cores.def @@ -188,13 +188,14 @@ AARCH64_CORE("cortex-x2", cortexx2, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8M AARCH64_CORE("cortex-x3", cortexx3, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8MM, BF16), neoversev2, 0x41, 0xd4e, -1) -AARCH64_CORE("cortex-x4", cortexx4, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversen2, 0x41, 0xd81, -1) +AARCH64_CORE("cortex-x4", cortexx4, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversev3, 0x41, 0xd81, -1) AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1) AARCH64_CORE("cobalt-100", cobalt100, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1) AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1) AARCH64_CORE("grace", grace, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE), neoversev2, 0x41, 0xd4f, -1) +AARCH64_CORE("neoverse-v3", neoversev3, cortexa57, V9_2A, (SVE2_BITPERM, RNG, LS64, MEMTAG, PROFILE), neoversev3, 0x41, 0xd84, -1) AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1) diff --git a/gcc/config/aarch64/aarch64-tune.md b/gcc/config/aarch64/aarch64-tune.md index 719fd3dc62a5..0c3339b53e42 100644 --- a/gcc/config/aarch64/aarch64-tune.md +++ b/gcc/config/aarch64/aarch64-tune.md @@ -1,5 +1,5 @@ ;; -*- buffer-read-only: t -*- ;; Generated automatically by gentune.sh from aarch64-cores.def (define_attr "tune" - "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,demeter,generic,generic_armv8_a,generic_armv9_a" + "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,neoversev3,demeter,generic,generic_armv8_a,generic_armv9_a" (const (symbol_ref "((enum attr_tune) aarch64_tune)"))) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 9810f2c03900..f29dcf7fe173 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -410,6 +410,7 @@ static const struct
[gcc r15-2644] AArch64: Add Cortex-X925 core definition and cost model
https://gcc.gnu.org/g:1f53319cae81aea438b6c0ba55f49e5669acf1c8 commit r15-2644-g1f53319cae81aea438b6c0ba55f49e5669acf1c8 Author: Tamar Christina Date: Thu Aug 1 16:55:10 2024 +0100 AArch64: Add Cortex-X925 core definition and cost model This adds a cost model and core definition for Cortex-X925. gcc/ChangeLog: * config/aarch64/aarch64-cores.def (cortex-x925): New. * config/aarch64/aarch64-tune.md: Regenerate. * config/aarch64/tuning_models/cortexx925.h: New file. * config/aarch64/aarch64.cc: Use it. * doc/invoke.texi: Document it. Diff: --- gcc/config/aarch64/aarch64-cores.def | 1 + gcc/config/aarch64/aarch64-tune.md| 2 +- gcc/config/aarch64/aarch64.cc | 1 + gcc/config/aarch64/tuning_models/cortexx925.h | 246 ++ gcc/doc/invoke.texi | 2 +- 5 files changed, 250 insertions(+), 2 deletions(-) diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def index 4d6f5a701eee..cc2260036887 100644 --- a/gcc/config/aarch64/aarch64-cores.def +++ b/gcc/config/aarch64/aarch64-cores.def @@ -190,6 +190,7 @@ AARCH64_CORE("cortex-x2", cortexx2, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8M AARCH64_CORE("cortex-x3", cortexx3, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8MM, BF16), neoversev2, 0x41, 0xd4e, -1) AARCH64_CORE("cortex-x4", cortexx4, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversev3, 0x41, 0xd81, -1) +AARCH64_CORE("cortex-x925", cortexx925, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), cortexx925, 0x41, 0xd85, -1) AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1) AARCH64_CORE("cobalt-100", cobalt100, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1) diff --git a/gcc/config/aarch64/aarch64-tune.md b/gcc/config/aarch64/aarch64-tune.md index d71c631b01c7..4fce0c507f6c 100644 --- a/gcc/config/aarch64/aarch64-tune.md +++ b/gcc/config/aarch64/aarch64-tune.md @@ -1,5 +1,5 @@ ;; -*- buffer-read-only: t -*- ;; Generated automatically by gentune.sh from aarch64-cores.def (define_attr "tune" - "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexa725,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversen3,neoversev2,grace,neoversev3,neoversev3ae,demeter,generic,generic_armv8_a,generic_armv9_a" + "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexa725,cortexx2,cortexx3,cortexx4,cortexx925,neoversen2,cobalt100,neoversen3,neoversev2,grace,neoversev3,neoversev3ae,demeter,generic,generic_armv8_a,generic_armv9_a" (const (symbol_ref "((enum attr_tune) aarch64_tune)"))) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index f1a57159d471..113ebb45cfda 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -392,6 +392,7 @@ static const struct aarch64_flag_desc aarch64_tuning_flags[] = #include "tuning_models/cortexa57.h" #include "tuning_models/cortexa72.h" #include "tuning_models/cortexa73.h" +#include "tuning_models/cortexx925.h" #include "tuning_models/exynosm1.h" #include "tuning_models/thunderxt88.h" #include "tuning_models/thunderx.h" diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h b/gcc/config/aarch64/tuning_models/cortexx925.h new file mode 100644 index ..6cae5b7de5ca --- /dev/null +++ b/gcc/config/aarch64/tuning_models/cortexx925.h @@ -0,0 +1,246 @@
[gcc r15-2768] AArch64: take gather/scatter decode overhead into account
https://gcc.gnu.org/g:a50916a6c0a6c73c1537d033509d4f7034341f75 commit r15-2768-ga50916a6c0a6c73c1537d033509d4f7034341f75 Author: Tamar Christina Date: Tue Aug 6 22:41:10 2024 +0100 AArch64: take gather/scatter decode overhead into account Gather and scatters are not usually beneficial when the loop count is small. This is because there's not only a cost to their execution within the loop but there is also some cost to enter loops with them. As such this patch models this overhead. For generic tuning we however still prefer gathers/scatters when the loop costs work out. gcc/ChangeLog: * config/aarch64/aarch64-protos.h (struct sve_vec_cost): Add gather_load_x32_init_cost and gather_load_x64_init_cost. * config/aarch64/aarch64.cc (aarch64_vector_costs): Add m_sve_gather_scatter_init_cost. (aarch64_vector_costs::add_stmt_cost): Use them. (aarch64_vector_costs::finish_cost): Likewise. * config/aarch64/tuning_models/a64fx.h: Update. * config/aarch64/tuning_models/cortexx925.h: Update. * config/aarch64/tuning_models/generic.h: Update. * config/aarch64/tuning_models/generic_armv8_a.h: Update. * config/aarch64/tuning_models/generic_armv9_a.h: Update. * config/aarch64/tuning_models/neoverse512tvb.h: Update. * config/aarch64/tuning_models/neoversen2.h: Update. * config/aarch64/tuning_models/neoversen3.h: Update. * config/aarch64/tuning_models/neoversev1.h: Update. * config/aarch64/tuning_models/neoversev2.h: Update. * config/aarch64/tuning_models/neoversev3.h: Update. * config/aarch64/tuning_models/neoversev3ae.h: Update. Diff: --- gcc/config/aarch64/aarch64-protos.h| 10 + gcc/config/aarch64/aarch64.cc | 26 ++ gcc/config/aarch64/tuning_models/a64fx.h | 2 ++ gcc/config/aarch64/tuning_models/cortexx925.h | 2 ++ gcc/config/aarch64/tuning_models/generic.h | 2 ++ gcc/config/aarch64/tuning_models/generic_armv8_a.h | 2 ++ gcc/config/aarch64/tuning_models/generic_armv9_a.h | 2 ++ gcc/config/aarch64/tuning_models/neoverse512tvb.h | 2 ++ gcc/config/aarch64/tuning_models/neoversen2.h | 2 ++ gcc/config/aarch64/tuning_models/neoversen3.h | 2 ++ gcc/config/aarch64/tuning_models/neoversev1.h | 2 ++ gcc/config/aarch64/tuning_models/neoversev2.h | 2 ++ gcc/config/aarch64/tuning_models/neoversev3.h | 2 ++ gcc/config/aarch64/tuning_models/neoversev3ae.h| 2 ++ 14 files changed, 60 insertions(+) diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h index f64afe288901..44b881b5c57a 100644 --- a/gcc/config/aarch64/aarch64-protos.h +++ b/gcc/config/aarch64/aarch64-protos.h @@ -262,6 +262,8 @@ struct sve_vec_cost : simd_vec_cost unsigned int fadda_f64_cost, unsigned int gather_load_x32_cost, unsigned int gather_load_x64_cost, + unsigned int gather_load_x32_init_cost, + unsigned int gather_load_x64_init_cost, unsigned int scatter_store_elt_cost) : simd_vec_cost (base), clast_cost (clast_cost), @@ -270,6 +272,8 @@ struct sve_vec_cost : simd_vec_cost fadda_f64_cost (fadda_f64_cost), gather_load_x32_cost (gather_load_x32_cost), gather_load_x64_cost (gather_load_x64_cost), + gather_load_x32_init_cost (gather_load_x32_init_cost), + gather_load_x64_init_cost (gather_load_x64_init_cost), scatter_store_elt_cost (scatter_store_elt_cost) {} @@ -289,6 +293,12 @@ struct sve_vec_cost : simd_vec_cost const int gather_load_x32_cost; const int gather_load_x64_cost; + /* Additional loop initialization cost of using a gather load instruction. The x32 + value is for loads of 32-bit elements and the x64 value is for loads of + 64-bit elements. */ + const int gather_load_x32_init_cost; + const int gather_load_x64_init_cost; + /* The per-element cost of a scatter store. */ const int scatter_store_elt_cost; }; diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 9e12bd9711cd..2ac5a22c848e 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -16231,6 +16231,10 @@ private: supported by Advanced SIMD and SVE2. */ bool m_has_avg = false; + /* Additional initialization costs for using gather or scatter operation in + the current loop. */ + unsigned int m_sve_gather_scatter_init_cost = 0; + /* True if the vector body contains a store to a decl and if the function is known to have a vld1 from the same decl. @@ -17295,6 +17299,23 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
[gcc r15-2839] AArch64: Fix signbit mask creation after late combine [PR116229]
https://gcc.gnu.org/g:2c24e0568392e51a77ebdaab629d631969ce8966 commit r15-2839-g2c24e0568392e51a77ebdaab629d631969ce8966 Author: Tamar Christina Date: Thu Aug 8 18:51:30 2024 +0100 AArch64: Fix signbit mask creation after late combine [PR116229] The optimization to generate a Di signbit constant by using fneg was relying on nothing being able to push the constant into the negate. It's run quite late for this reason. However late combine now runs after it and triggers RTL simplification based on the neg. When -fno-signed-zeros this ends up dropping the - from the -0.0 and thus producing incorrect code. This change adds a new unspec FNEG on DI mode which prevents this simplication. gcc/ChangeLog: PR target/116229 * config/aarch64/aarch64-simd.md (aarch64_fnegv2di2): New. * config/aarch64/aarch64.cc (aarch64_maybe_generate_simd_constant): Update call to gen_aarch64_fnegv2di2. * config/aarch64/iterators.md: New UNSPEC_FNEG. gcc/testsuite/ChangeLog: PR target/116229 * gcc.target/aarch64/pr116229.c: New test. Diff: --- gcc/config/aarch64/aarch64-simd.md | 9 + gcc/config/aarch64/aarch64.cc | 4 ++-- gcc/config/aarch64/iterators.md | 1 + gcc/testsuite/gcc.target/aarch64/pr116229.c | 20 4 files changed, 32 insertions(+), 2 deletions(-) diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 816f499e9634..cc612ec2ca0e 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -2629,6 +2629,15 @@ [(set_attr "type" "neon_fp_neg_")] ) +(define_insn "aarch64_fnegv2di2" + [(set (match_operand:V2DI 0 "register_operand" "=w") + (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "w")] + UNSPEC_FNEG))] + "TARGET_SIMD" + "fneg\\t%0.2d, %1.2d" + [(set_attr "type" "neon_fp_neg_d")] +) + (define_insn "abs2" [(set (match_operand:VHSDF 0 "register_operand" "=w") (abs:VHSDF (match_operand:VHSDF 1 "register_operand" "w")))] diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 2ac5a22c848e..bfd7bcdef7cb 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -11808,8 +11808,8 @@ aarch64_maybe_generate_simd_constant (rtx target, rtx val, machine_mode mode) /* Use the same base type as aarch64_gen_shareable_zero. */ rtx zero = CONST0_RTX (V4SImode); emit_move_insn (lowpart_subreg (V4SImode, target, mode), zero); - rtx neg = lowpart_subreg (V2DFmode, target, mode); - emit_insn (gen_negv2df2 (neg, copy_rtx (neg))); + rtx neg = lowpart_subreg (V2DImode, target, mode); + emit_insn (gen_aarch64_fnegv2di2 (neg, copy_rtx (neg))); return true; } diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md index aaa4afefe2ce..20a318e023b6 100644 --- a/gcc/config/aarch64/iterators.md +++ b/gcc/config/aarch64/iterators.md @@ -689,6 +689,7 @@ UNSPEC_FMINNMV ; Used in aarch64-simd.md. UNSPEC_FMINV ; Used in aarch64-simd.md. UNSPEC_FADDV ; Used in aarch64-simd.md. +UNSPEC_FNEG; Used in aarch64-simd.md. UNSPEC_ADDV; Used in aarch64-simd.md. UNSPEC_SMAXV ; Used in aarch64-simd.md. UNSPEC_SMINV ; Used in aarch64-simd.md. diff --git a/gcc/testsuite/gcc.target/aarch64/pr116229.c b/gcc/testsuite/gcc.target/aarch64/pr116229.c new file mode 100644 index ..cc42078478f7 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr116229.c @@ -0,0 +1,20 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -fno-signed-zeros" } */ + +typedef __attribute__((__vector_size__ (8))) unsigned long V; + +V __attribute__((__noipa__)) +foo (void) +{ + return (V){ 0x8000 }; +} + +V ref = (V){ 0x8000 }; + +int +main () +{ + V v = foo (); + if (v[0] != ref[0]) +__builtin_abort(); +}
[gcc r15-1038] AArch64: convert several predicate patterns to new compact syntax
https://gcc.gnu.org/g:fd4898891ae0c73d6b7aa433cd1ef4539aaa2457 commit r15-1038-gfd4898891ae0c73d6b7aa433cd1ef4539aaa2457 Author: Tamar Christina Date: Wed Jun 5 19:30:39 2024 +0100 AArch64: convert several predicate patterns to new compact syntax This converts the single alternative patterns to the new compact syntax such that when I add the new alternatives it's clearer what's being changed. Note that this will spew out a bunch of warnings from geninsn as it'll warn that @ is useless for a single alternative pattern. These are not fatal so won't break the build and are only temporary. No change in functionality is expected with this patch. gcc/ChangeLog: * config/aarch64/aarch64-sve.md (and3, @aarch64_pred__z, *3_cc, *3_ptest, aarch64_pred__z, *3_cc, *3_ptest, aarch64_pred__z, *3_cc, *3_ptest, *cmp_ptest, @aarch64_pred_cmp_wide, *aarch64_pred_cmp_wide_cc, *aarch64_pred_cmp_wide_ptest, *aarch64_brk_cc, *aarch64_brk_ptest, @aarch64_brk, *aarch64_brk_cc, *aarch64_brk_ptest, aarch64_rdffr_z, *aarch64_rdffr_z_ptest, *aarch64_rdffr_ptest, *aarch64_rdffr_z_cc, *aarch64_rdffr_cc): Convert to compact syntax. * config/aarch64/aarch64-sve2.md (@aarch64_pred_): Likewise. Diff: --- gcc/config/aarch64/aarch64-sve.md | 262 ++--- gcc/config/aarch64/aarch64-sve2.md | 12 +- 2 files changed, 161 insertions(+), 113 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index 0434358122d..ca4d435e705 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -1156,76 +1156,86 @@ ;; Likewise with zero predication. (define_insn "aarch64_rdffr_z" - [(set (match_operand:VNx16BI 0 "register_operand" "=Upa") + [(set (match_operand:VNx16BI 0 "register_operand") (and:VNx16BI (reg:VNx16BI FFRT_REGNUM) - (match_operand:VNx16BI 1 "register_operand" "Upa")))] + (match_operand:VNx16BI 1 "register_operand")))] "TARGET_SVE && TARGET_NON_STREAMING" - "rdffr\t%0.b, %1/z" + {@ [ cons: =0, 1 ] + [ Upa , Upa ] rdffr\t%0.b, %1/z + } ) ;; Read the FFR to test for a fault, without using the predicate result. (define_insn "*aarch64_rdffr_z_ptest" [(set (reg:CC_NZC CC_REGNUM) (unspec:CC_NZC - [(match_operand:VNx16BI 1 "register_operand" "Upa") + [(match_operand:VNx16BI 1 "register_operand") (match_dup 1) (match_operand:SI 2 "aarch64_sve_ptrue_flag") (and:VNx16BI (reg:VNx16BI FFRT_REGNUM) (match_dup 1))] UNSPEC_PTEST)) - (clobber (match_scratch:VNx16BI 0 "=Upa"))] + (clobber (match_scratch:VNx16BI 0))] "TARGET_SVE && TARGET_NON_STREAMING" - "rdffrs\t%0.b, %1/z" + {@ [ cons: =0, 1 ] + [ Upa , Upa ] rdffrs\t%0.b, %1/z + } ) ;; Same for unpredicated RDFFR when tested with a known PTRUE. (define_insn "*aarch64_rdffr_ptest" [(set (reg:CC_NZC CC_REGNUM) (unspec:CC_NZC - [(match_operand:VNx16BI 1 "register_operand" "Upa") + [(match_operand:VNx16BI 1 "register_operand") (match_dup 1) (const_int SVE_KNOWN_PTRUE) (reg:VNx16BI FFRT_REGNUM)] UNSPEC_PTEST)) - (clobber (match_scratch:VNx16BI 0 "=Upa"))] + (clobber (match_scratch:VNx16BI 0))] "TARGET_SVE && TARGET_NON_STREAMING" - "rdffrs\t%0.b, %1/z" + {@ [ cons: =0, 1 ] + [ Upa , Upa ] rdffrs\t%0.b, %1/z + } ) ;; Read the FFR with zero predication and test the result. (define_insn "*aarch64_rdffr_z_cc" [(set (reg:CC_NZC CC_REGNUM) (unspec:CC_NZC - [(match_operand:VNx16BI 1 "register_operand" "Upa") + [(match_operand:VNx16BI 1 "register_operand") (match_dup 1) (match_operand:SI 2 "aarch64_sve_ptrue_flag") (and:VNx16BI (reg:VNx16BI FFRT_REGNUM) (match_dup 1))] UNSPEC_PTEST)) - (set (match_operand:VNx16BI 0 "register_operand" "=Upa") + (set (match_operand:VNx16BI 0 "register_operand") (and:VNx16BI (reg:VNx16BI FFRT_REGNUM) (match_dup 1)))] "TARGET_SVE && TARGET_NON_STREAMING" - "rdffrs\t%0.b, %1/z" + {@ [ cons: =0, 1 ] + [ Upa , Upa ] rdffrs\t%0.b, %1/z + } ) ;; Same for unpredicated RDFFR when tested with a known PTRUE. (define_insn "*aarch64_rdffr_cc" [(set (reg:CC_NZC CC_REGNUM) (unspec:CC_NZC - [(match_operand:VNx16BI 1 "register_operand" "Upa") + [(match_operand:VNx16BI 1 "register_operand") (match_dup 1) (const_int SVE_KNOWN_PTRUE) (reg:VNx16BI FFRT_REGNUM)] UNSPEC_PTEST)) - (set (match_operand:VNx16BI 0 "register_operand" "=Upa") + (set (match_operand:VNx16BI 0 "registe
[gcc r15-1039] AArch64: add new tuning param and attribute for enabling conditional early clobber
https://gcc.gnu.org/g:35f17c680ca650f8658994f857358e5a529c0b93 commit r15-1039-g35f17c680ca650f8658994f857358e5a529c0b93 Author: Tamar Christina Date: Wed Jun 5 19:31:11 2024 +0100 AArch64: add new tuning param and attribute for enabling conditional early clobber This adds a new tuning parameter AARCH64_EXTRA_TUNE_AVOID_PRED_RMW for AArch64 to allow us to conditionally enable the early clobber alternatives based on the tuning models. gcc/ChangeLog: * config/aarch64/aarch64-tuning-flags.def (AVOID_PRED_RMW): New. * config/aarch64/aarch64.h (TARGET_SVE_PRED_CLOBBER): New. * config/aarch64/aarch64.md (pred_clobber): New. (arch_enabled): Use it. Diff: --- gcc/config/aarch64/aarch64-tuning-flags.def | 4 gcc/config/aarch64/aarch64.h| 5 + gcc/config/aarch64/aarch64.md | 18 -- 3 files changed, 25 insertions(+), 2 deletions(-) diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def index d5bcaebce77..a9f48f5d3d4 100644 --- a/gcc/config/aarch64/aarch64-tuning-flags.def +++ b/gcc/config/aarch64/aarch64-tuning-flags.def @@ -48,4 +48,8 @@ AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA) AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma", FULLY_PIPELINED_FMA) +/* Enable is the target prefers to use a fresh register for predicate outputs + rather than re-use an input predicate register. */ +AARCH64_EXTRA_TUNING_OPTION ("avoid_pred_rmw", AVOID_PRED_RMW) + #undef AARCH64_EXTRA_TUNING_OPTION diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h index bbf11faaf4b..0997b82dbc0 100644 --- a/gcc/config/aarch64/aarch64.h +++ b/gcc/config/aarch64/aarch64.h @@ -495,6 +495,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = AARCH64_FL_SM_OFF; enabled through +gcs. */ #define TARGET_GCS (AARCH64_ISA_GCS) +/* Prefer different predicate registers for the output of a predicated + operation over re-using an existing input predicate. */ +#define TARGET_SVE_PRED_CLOBBER (TARGET_SVE \ +&& (aarch64_tune_params.extra_tuning_flags \ +& AARCH64_EXTRA_TUNE_AVOID_PRED_RMW)) /* Standard register usage. */ diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 9dff2d7a2b0..389a1906e23 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -445,6 +445,10 @@ ;; target-independent code. (define_attr "is_call" "no,yes" (const_string "no")) +;; Indicates whether we want to enable the pattern with an optional early +;; clobber for SVE predicates. +(define_attr "pred_clobber" "any,no,yes" (const_string "any")) + ;; [For compatibility with Arm in pipeline models] ;; Attribute that specifies whether or not the instruction touches fp ;; registers. @@ -460,7 +464,17 @@ (define_attr "arch_enabled" "no,yes" (if_then_else -(ior +(and + (ior + (and + (eq_attr "pred_clobber" "no") + (match_test "!TARGET_SVE_PRED_CLOBBER")) + (and + (eq_attr "pred_clobber" "yes") + (match_test "TARGET_SVE_PRED_CLOBBER")) + (eq_attr "pred_clobber" "any")) + + (ior (eq_attr "arch" "any") (and (eq_attr "arch" "rcpc8_4") @@ -488,7 +502,7 @@ (match_test "TARGET_SVE")) (and (eq_attr "arch" "sme") -(match_test "TARGET_SME"))) +(match_test "TARGET_SME" (const_string "yes") (const_string "no")))
[gcc r15-1040] AArch64: add new alternative with early clobber to patterns
https://gcc.gnu.org/g:2de3bbde1ebea8689f3596967769f66bf903458e commit r15-1040-g2de3bbde1ebea8689f3596967769f66bf903458e Author: Tamar Christina Date: Wed Jun 5 19:31:39 2024 +0100 AArch64: add new alternative with early clobber to patterns This patch adds new alternatives to the patterns which are affected. The new alternatives with the conditional early clobbers are added before the normal ones in order for LRA to prefer them in the event that we have enough free registers to accommodate them. In case register pressure is too high the normal alternatives will be preferred before a reload is considered as we rather have the tie than a spill. Tests are in the next patch. gcc/ChangeLog: * config/aarch64/aarch64-sve.md (and3, @aarch64_pred__z, *3_cc, *3_ptest, aarch64_pred__z, *3_cc, *3_ptest, aarch64_pred__z, *3_cc, *3_ptest, @aarch64_pred_cmp, *cmp_cc, *cmp_ptest, @aarch64_pred_cmp_wide, *aarch64_pred_cmp_wide_cc, *aarch64_pred_cmp_wide_ptest, @aarch64_brk, *aarch64_brk_cc, *aarch64_brk_ptest, @aarch64_brk, *aarch64_brk_cc, *aarch64_brk_ptest, aarch64_rdffr_z, *aarch64_rdffr_z_ptest, *aarch64_rdffr_ptest, *aarch64_rdffr_z_cc, *aarch64_rdffr_cc): Add new early clobber alternative. * config/aarch64/aarch64-sve2.md (@aarch64_pred_): Likewise. Diff: --- gcc/config/aarch64/aarch64-sve.md | 178 + gcc/config/aarch64/aarch64-sve2.md | 6 +- 2 files changed, 124 insertions(+), 60 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index ca4d435e705..d902bce62fd 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -1161,8 +1161,10 @@ (reg:VNx16BI FFRT_REGNUM) (match_operand:VNx16BI 1 "register_operand")))] "TARGET_SVE && TARGET_NON_STREAMING" - {@ [ cons: =0, 1 ] - [ Upa , Upa ] rdffr\t%0.b, %1/z + {@ [ cons: =0, 1 ; attrs: pred_clobber ] + [ &Upa, Upa ; yes ] rdffr\t%0.b, %1/z + [ ?Upa, 0Upa; yes ] ^ + [ Upa , Upa ; no ] ^ } ) @@ -1179,8 +1181,10 @@ UNSPEC_PTEST)) (clobber (match_scratch:VNx16BI 0))] "TARGET_SVE && TARGET_NON_STREAMING" - {@ [ cons: =0, 1 ] - [ Upa , Upa ] rdffrs\t%0.b, %1/z + {@ [ cons: =0, 1 ; attrs: pred_clobber ] + [ &Upa, Upa ; yes ] rdffrs\t%0.b, %1/z + [ ?Upa, 0Upa; yes ] ^ + [ Upa , Upa ; no ] ^ } ) @@ -1195,8 +1199,10 @@ UNSPEC_PTEST)) (clobber (match_scratch:VNx16BI 0))] "TARGET_SVE && TARGET_NON_STREAMING" - {@ [ cons: =0, 1 ] - [ Upa , Upa ] rdffrs\t%0.b, %1/z + {@ [ cons: =0, 1 ; attrs: pred_clobber ] + [ &Upa, Upa ; yes ] rdffrs\t%0.b, %1/z + [ ?Upa, 0Upa; yes ] ^ + [ Upa , Upa ; no ] ^ } ) @@ -1216,8 +1222,10 @@ (reg:VNx16BI FFRT_REGNUM) (match_dup 1)))] "TARGET_SVE && TARGET_NON_STREAMING" - {@ [ cons: =0, 1 ] - [ Upa , Upa ] rdffrs\t%0.b, %1/z + {@ [ cons: =0, 1 ; attrs: pred_clobber ] + [ &Upa, Upa ; yes ] rdffrs\t%0.b, %1/z + [ ?Upa, 0Upa; yes ] ^ + [ Upa , Upa ; no ] ^ } ) @@ -1233,8 +1241,10 @@ (set (match_operand:VNx16BI 0 "register_operand") (reg:VNx16BI FFRT_REGNUM))] "TARGET_SVE && TARGET_NON_STREAMING" - {@ [ cons: =0, 1 ] - [ Upa , Upa ] rdffrs\t%0.b, %1/z + {@ [ cons: =0, 1 ; attrs: pred_clobber ] + [ &Upa, Upa ; yes ] rdffrs\t%0.b, %1/z + [ ?Upa, 0Upa; yes ] ^ + [ Upa , Upa ; no ] ^ } ) @@ -6651,8 +6661,10 @@ (and:PRED_ALL (match_operand:PRED_ALL 1 "register_operand") (match_operand:PRED_ALL 2 "register_operand")))] "TARGET_SVE" - {@ [ cons: =0, 1 , 2 ] - [ Upa , Upa, Upa ] and\t%0.b, %1/z, %2.b, %2.b + {@ [ cons: =0, 1 , 2 ; attrs: pred_clobber ] + [ &Upa, Upa , Upa ; yes ] and\t%0.b, %1/z, %2.b, %2.b + [ ?Upa, 0Upa, 0Upa; yes ] ^ + [ Upa , Upa , Upa ; no ] ^ } ) @@ -6679,8 +6691,10 @@ (match_operand:PRED_ALL 3 "register_operand")) (match_operand:PRED_ALL 1 "register_operand")))] "TARGET_SVE" - {@ [ cons: =0, 1 , 2 , 3 ] - [ Upa , Upa, Upa, Upa ] \t%0.b, %1/z, %2.b, %3.b + {@ [ cons: =0, 1 , 2 , 3 ; attrs: pred_clobber ] + [ &Upa, Upa , Upa , Upa ; yes ] \t%0.b, %1/z, %2.b, %3.b + [ ?Upa, 0Upa, 0Upa, 0Upa; yes
[gcc r15-1041] AArch64: enable new predicate tuning for Neoverse cores.
https://gcc.gnu.org/g:3eb9f6eab9802d5ae65ead6b1f2ae6fe0833e06e commit r15-1041-g3eb9f6eab9802d5ae65ead6b1f2ae6fe0833e06e Author: Tamar Christina Date: Wed Jun 5 19:32:16 2024 +0100 AArch64: enable new predicate tuning for Neoverse cores. This enables the new tuning flag for Neoverse V1, Neoverse V2 and Neoverse N2. It is kept off for generic codegen. Note the reason for the +sve even though they are in aarch64-sve.exp is if the testsuite is ran with a forced SVE off option, e.g. -march=armv8-a+nosve then the intrinsics end up being disabled because the -march is preferred over the -mcpu even though the -mcpu comes later. This prevents the tests from failing in such runs. gcc/ChangeLog: * config/aarch64/tuning_models/neoversen2.h (neoversen2_tunings): Add AARCH64_EXTRA_TUNE_AVOID_PRED_RMW. * config/aarch64/tuning_models/neoversev1.h (neoversev1_tunings): Add AARCH64_EXTRA_TUNE_AVOID_PRED_RMW. * config/aarch64/tuning_models/neoversev2.h (neoversev2_tunings): Add AARCH64_EXTRA_TUNE_AVOID_PRED_RMW. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/pred_clobber_1.c: New test. * gcc.target/aarch64/sve/pred_clobber_2.c: New test. * gcc.target/aarch64/sve/pred_clobber_3.c: New test. * gcc.target/aarch64/sve/pred_clobber_4.c: New test. Diff: --- gcc/config/aarch64/tuning_models/neoversen2.h | 3 ++- gcc/config/aarch64/tuning_models/neoversev1.h | 3 ++- gcc/config/aarch64/tuning_models/neoversev2.h | 3 ++- .../gcc.target/aarch64/sve/pred_clobber_1.c| 22 + .../gcc.target/aarch64/sve/pred_clobber_2.c| 22 + .../gcc.target/aarch64/sve/pred_clobber_3.c| 23 ++ .../gcc.target/aarch64/sve/pred_clobber_4.c| 22 + 7 files changed, 95 insertions(+), 3 deletions(-) diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h b/gcc/config/aarch64/tuning_models/neoversen2.h index 7e799bbe762..be9a48ac3ad 100644 --- a/gcc/config/aarch64/tuning_models/neoversen2.h +++ b/gcc/config/aarch64/tuning_models/neoversen2.h @@ -236,7 +236,8 @@ static const struct tune_params neoversen2_tunings = (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS - | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),/* tune_flags. */ + | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT + | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ &generic_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model. */ diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h b/gcc/config/aarch64/tuning_models/neoversev1.h index 9363f2ad98a..0fc41ce6a41 100644 --- a/gcc/config/aarch64/tuning_models/neoversev1.h +++ b/gcc/config/aarch64/tuning_models/neoversev1.h @@ -227,7 +227,8 @@ static const struct tune_params neoversev1_tunings = (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT - | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags. */ + | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND + | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ &generic_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ AARCH64_LDP_STP_POLICY_ALWAYS/* stp_policy_model. */ diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h b/gcc/config/aarch64/tuning_models/neoversev2.h index bc01ed767c9..f76e4ef358f 100644 --- a/gcc/config/aarch64/tuning_models/neoversev2.h +++ b/gcc/config/aarch64/tuning_models/neoversev2.h @@ -236,7 +236,8 @@ static const struct tune_params neoversev2_tunings = (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS - | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),/* tune_flags. */ + | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT + | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ &generic_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model. */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c b/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c new file mode 100644 index 000..25129e8d6f2 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=neoverse-n2" } */ +/* { dg-final { check-function-bodies "**" "" } } */ + +#pragma GCC target "+sve" + +#include + +extern void use(svbool_t); + +/* +** foo: +** ... +** ptrue p([1-3]).b, all +** cmplo p0.h, p\1/z, z0.h, z[0-9]+.h +** ...
[gcc r15-1071] AArch64: correct constraint on Upl early clobber alternatives
https://gcc.gnu.org/g:afe85f8e22a703280b17c701f3490d89337f674a commit r15-1071-gafe85f8e22a703280b17c701f3490d89337f674a Author: Tamar Christina Date: Thu Jun 6 14:35:48 2024 +0100 AArch64: correct constraint on Upl early clobber alternatives I made an oversight in the previous patch, where I added a ?Upa alternative to the Upl cases. This causes it to create the tie between the larger register file rather than the constrained one. This fixes the affected patterns. gcc/ChangeLog: * config/aarch64/aarch64-sve.md (@aarch64_pred_cmp, *cmp_cc, *cmp_ptest, @aarch64_pred_cmp_wide, *aarch64_pred_cmp_wide_cc, *aarch64_pred_cmp_wide_ptest): Fix Upl tie alternative. * config/aarch64/aarch64-sve2.md (@aarch64_pred_): Fix Upl tie alternative. Diff: --- gcc/config/aarch64/aarch64-sve.md | 64 +++--- gcc/config/aarch64/aarch64-sve2.md | 2 +- 2 files changed, 33 insertions(+), 33 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index d902bce62fd..d69db34016a 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -8134,13 +8134,13 @@ UNSPEC_PRED_Z)) (clobber (reg:CC_NZC CC_REGNUM))] "TARGET_SVE" - {@ [ cons: =0 , 1 , 3 , 4; attrs: pred_clobber ] - [ &Upa , Upl , w , ; yes ] cmp\t%0., %1/z, %3., #%4 - [ ?Upa , 0Upl, w , ; yes ] ^ - [ Upa , Upl , w , ; no ] ^ - [ &Upa , Upl , w , w; yes ] cmp\t%0., %1/z, %3., %4. - [ ?Upa , 0Upl, w , w; yes ] ^ - [ Upa , Upl , w , w; no ] ^ + {@ [ cons: =0 , 1 , 3 , 4; attrs: pred_clobber ] + [ &Upa , Upl, w , ; yes ] cmp\t%0., %1/z, %3., #%4 + [ ?Upl , 0 , w , ; yes ] ^ + [ Upa , Upl, w , ; no ] ^ + [ &Upa , Upl, w , w; yes ] cmp\t%0., %1/z, %3., %4. + [ ?Upl , 0 , w , w; yes ] ^ + [ Upa , Upl, w , w; no ] ^ } ) @@ -8170,13 +8170,13 @@ UNSPEC_PRED_Z))] "TARGET_SVE && aarch64_sve_same_pred_for_ptest_p (&operands[4], &operands[6])" - {@ [ cons: =0 , 1, 2 , 3; attrs: pred_clobber ] - [ &Upa , Upl , w , ; yes ] cmp\t%0., %1/z, %2., #%3 - [ ?Upa , 0Upl, w , ; yes ] ^ - [ Upa , Upl , w , ; no ] ^ - [ &Upa , Upl , w , w; yes ] cmp\t%0., %1/z, %2., %3. - [ ?Upa , 0Upl, w , w; yes ] ^ - [ Upa , Upl , w , w; no ] ^ + {@ [ cons: =0 , 1 , 2 , 3; attrs: pred_clobber ] + [ &Upa , Upl, w , ; yes ] cmp\t%0., %1/z, %2., #%3 + [ ?Upl , 0 , w , ; yes ] ^ + [ Upa , Upl, w , ; no ] ^ + [ &Upa , Upl, w , w; yes ] cmp\t%0., %1/z, %2., %3. + [ ?Upl , 0 , w , w; yes ] ^ + [ Upa , Upl, w , w; no ] ^ } "&& !rtx_equal_p (operands[4], operands[6])" { @@ -8205,12 +8205,12 @@ "TARGET_SVE && aarch64_sve_same_pred_for_ptest_p (&operands[4], &operands[6])" {@ [ cons: =0, 1, 2 , 3; attrs: pred_clobber ] - [ &Upa, Upl , w , ; yes ] cmp\t%0., %1/z, %2., #%3 - [ ?Upa, 0Upl, w , ; yes ] ^ - [ Upa , Upl , w , ; no ] ^ - [ &Upa, Upl , w , w; yes ] cmp\t%0., %1/z, %2., %3. - [ ?Upa, 0Upl, w , w; yes ] ^ - [ Upa , Upl , w , w; no ] ^ + [ &Upa, Upl, w , ; yes ] cmp\t%0., %1/z, %2., #%3 + [ ?Upl, 0 , w , ; yes ] ^ + [ Upa , Upl, w , ; no ] ^ + [ &Upa, Upl, w , w; yes ] cmp\t%0., %1/z, %2., %3. + [ ?Upl, 0 , w , w; yes ] ^ + [ Upa , Upl, w , w; no ] ^ } "&& !rtx_equal_p (operands[4], operands[6])" { @@ -8263,10 +8263,10 @@ UNSPEC_PRED_Z)) (clobber (reg:CC_NZC CC_REGNUM))] "TARGET_SVE" - {@ [ cons: =0, 1, 2, 3, 4; attrs: pred_clobber ] - [ &Upa, Upl , , w, w; yes ] cmp\t%0., %1/z, %3., %4.d - [ ?Upa, 0Upl, , w, w; yes ] ^ - [ Upa , Upl , , w, w; no ] ^ + {@ [ cons: =0, 1 , 2, 3, 4; attrs: pred_clobber ] + [ &Upa
[gcc r15-4324] middle-end: support SLP early break
https://gcc.gnu.org/g:accb85345edb91368221fd07b74e74df427b7de0 commit r15-4324-gaccb85345edb91368221fd07b74e74df427b7de0 Author: Tamar Christina Date: Mon Oct 14 11:58:59 2024 +0100 middle-end: support SLP early break This patch introduces feature parity for early break int the SLP only vectorizer. The approach taken here is to treat the early exits as root statements for an SLP tree. This means that we don't need any changes to build_slp to support gconds. Codegen for the gcond itself now has to be done out of line but the body of the SLP blocks itself is simply driven by SLP scheduling. There is a slight awkwardness in having re-used vectorizable_early_exit for both SLP and non-SLP but I've documented the differences and when I did try to refactor it it wasn't really worth it given that this is a temporary state anyway. This version is restricted to lane = 1, as such we can re-use the existing move_early_break function instead of having to do safety update through scheduling. I have a branch where I'm working on that but lane > 1 is out of scope for GCC 15 anyway. The only reason I will try to get moving through scheduling done as a stretch goal is so we get epilogue vectorization back for early break. The example: unsigned test4(unsigned x) { unsigned ret = 0; for (int i = 0; i < N; i++) { vect_b[i] = x + i; if (vect_a[i]*2 != x) break; vect_a[i] = x; } return ret; } builds the following SLP instance for early break: note: Analyzing vectorizable control flow: if (patt_6 != 0) note: Starting SLP discovery for note: patt_6 = _4 != x_9(D); note: starting SLP discovery for node 0x63abc80 note: Build SLP for patt_6 = _4 != x_9(D); note: precomputed vectype: vector(4) note: nunits = 4 note: vect_is_simple_use: operand x_9(D), type of def: external note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] MASK 0x _3 * 2, type of def: internal note: starting SLP discovery for node 0x63abdc0 note: Build SLP for _4 = _3 * 2; note: precomputed vectype: vector(4) unsigned int note: nunits = 4 note: vect_is_simple_use: operand # vect_aD.4416[i_15], type of def: internal note: vect_is_simple_use: operand 2, type of def: constant note: starting SLP discovery for node 0x63abe60 note: Build SLP for _3 = vect_a[i_15]; note: precomputed vectype: vector(4) unsigned int note: nunits = 4 note: SLP discovery for node 0x63abe60 succeeded note: SLP discovery for node 0x63abdc0 succeeded note: SLP discovery for node 0x63abc80 succeeded note: SLP size 3 vs. limit 10. note: Final SLP tree for instance 0x6474190: note: node 0x63abc80 (max_nunits=4, refcnt=2) vector(4) note: op template: patt_6 = _4 != x_9(D); note: stmt 0 patt_6 = _4 != x_9(D); note: children 0x63abd20 0x63abdc0 note: node (external) 0x63abd20 (max_nunits=1, refcnt=1) note: { x_9(D) } note: node 0x63abdc0 (max_nunits=4, refcnt=2) vector(4) unsigned int note: op template: _4 = _3 * 2; note: stmt 0 _4 = _3 * 2; note: children 0x63abe60 0x63abf00 note: node 0x63abe60 (max_nunits=4, refcnt=2) vector(4) unsigned int note: op template: _3 = vect_a[i_15]; note: stmt 0 _3 = vect_a[i_15]; note: load permutation { 0 } note: node (constant) 0x63abf00 (max_nunits=1, refcnt=1) note: { 2 } and during codegen: note: -->vectorizing SLP node starting from: patt_6 = _4 != x_9(D); note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] MASK 0x _3 * 2, type of def: internal note: add new stmt: mask_patt_6.18_58 = _53 != vect__4.17_57; note:=== vectorizable_early_exit === note:transform early-exit. note: vectorizing stmts using SLP. note: Vectorizing SLP tree: note: node 0x63abfa0 (max_nunits=4, refcnt=1) vector(4) int note: op template: i_12 = i_15 + 1; note: stmt 0 i_12 = i_15 + 1; note: children 0x63aba00 0x63ac040 note: node 0x63aba00 (max_nunits=4, refcnt=2) vector(4) int note: op template: i_15 = PHI note: [l] stmt 0 i_15 = PHI note: children (nil) (nil) note: node (constant) 0x63ac040 (max_nunits=1, refcnt=1) vector(4) int note: { 1 } gcc/ChangeLog: * tree-vect-loop.cc (vect_analyze_loop_2): Handle SLP trees with no children. * tree-vectorizer.h (enum slp_instance_kind): Add slp_inst_kind_gcond. (LOOP_VINFO_EARLY_BREAKS_LIVE_IVS): New. (vectorizable
[gcc r15-4353] AArch64: re-enable memory access costing after SLP change.
https://gcc.gnu.org/g:a1540bb843fd1a3e87f50d3f713386eaae454d1c commit r15-4353-ga1540bb843fd1a3e87f50d3f713386eaae454d1c Author: Tamar Christina Date: Tue Oct 15 11:22:26 2024 +0100 AArch64: re-enable memory access costing after SLP change. While chasing down a costing difference between SLP and non-SLP for memory access costing I noticed that at some point the SLP and non-SLP costing have diverged. It used to be we only supported LOAD_LANES in SLP and so the non-SLP costing was working fine. But with the change to SLP only we now lost costing. It looks like the vectorizer for non-SLP stores the VMAT type in STMT_VINFO_MEMORY_ACCESS_TYPE on the stmt_info, but for SLP it stores it in SLP_TREE_MEMORY_ACCESS_TYPE which is on the SLP node itself. While my first attempt of a patch was to just also store the VMAT in the stmt_info https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665295.html Richi pointed out that this goes wrong when the same access is used Hybrid. And so we have to do a backend specific fix. To help out other backends this also introduces a generic helper function suggested by Richi in that patch (I hope that's ok.. I didn't want to split out just the helper.) This successfully restores VMAT based costing in the new SLP only world. gcc/ChangeLog: * tree-vectorizer.h (vect_mem_access_type): New. * config/aarch64/aarch64.cc (aarch64_ld234_st234_vectors): Use it. (aarch64_detect_vector_stmt_subtype): Likewise. (aarch64_adjust_stmt_cost): Likewise. (aarch64_vector_costs::count_ops): Likewise. (aarch64_vector_costs::add_stmt_cost): Make SLP node named. Diff: --- gcc/config/aarch64/aarch64.cc | 54 +++ gcc/tree-vectorizer.h | 12 ++ 2 files changed, 41 insertions(+), 25 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 102680a0efca..5770491b30ce 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -16278,7 +16278,7 @@ public: private: void record_potential_advsimd_unrolling (loop_vec_info); void analyze_loop_vinfo (loop_vec_info); - void count_ops (unsigned int, vect_cost_for_stmt, stmt_vec_info, + void count_ops (unsigned int, vect_cost_for_stmt, stmt_vec_info, slp_tree, aarch64_vec_op_count *); fractional_cost adjust_body_cost_sve (const aarch64_vec_op_count *, fractional_cost, unsigned int, @@ -16595,11 +16595,13 @@ aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, } } -/* Return true if an access of kind KIND for STMT_INFO represents one - vector of an LD[234] or ST[234] operation. Return the total number of - vectors (2, 3 or 4) if so, otherwise return a value outside that range. */ +/* Return true if an access of kind KIND for STMT_INFO (or NODE if SLP) + represents one vector of an LD[234] or ST[234] operation. Return the total + number of vectors (2, 3 or 4) if so, otherwise return a value outside that + range. */ static int -aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, stmt_vec_info stmt_info) +aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, stmt_vec_info stmt_info, +slp_tree node) { if ((kind == vector_load || kind == unaligned_load @@ -16609,7 +16611,7 @@ aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, stmt_vec_info stmt_info) { stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info); if (stmt_info - && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_LOAD_STORE_LANES) + && vect_mem_access_type (stmt_info, node) == VMAT_LOAD_STORE_LANES) return DR_GROUP_SIZE (stmt_info); } return 0; @@ -16847,14 +16849,15 @@ aarch64_detect_scalar_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind, } /* STMT_COST is the cost calculated by aarch64_builtin_vectorization_cost - for the vectorized form of STMT_INFO, which has cost kind KIND and which - when vectorized would operate on vector type VECTYPE. Try to subdivide - the target-independent categorization provided by KIND to get a more - accurate cost. WHERE specifies where the cost associated with KIND - occurs. */ + for the vectorized form of STMT_INFO possibly using SLP node NODE, which has + cost kind KIND and which when vectorized would operate on vector type + VECTYPE. Try to subdivide the target-independent categorization provided by + KIND to get a more accurate cost. WHERE specifies where the cost associated + with KIND occurs. */ static fractional_cost aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind, - stmt_vec_info stmt_info, tree vectype, + stmt_vec_info stmt_info, slp_tree
[gcc r15-4460] AArch64: support encoding integer immediates using floating point moves
https://gcc.gnu.org/g:87dc6b1992e7ee02e7a4a81c568754198c0f61f5 commit r15-4460-g87dc6b1992e7ee02e7a4a81c568754198c0f61f5 Author: Tamar Christina Date: Fri Oct 18 09:43:45 2024 +0100 AArch64: support encoding integer immediates using floating point moves This patch extends our immediate SIMD generation cases to support generating integer immediates using floating point operation if the integer immediate maps to an exact FP value. As an example: uint32x4_t f1() { return vdupq_n_u32(0x3f80); } currently generates: f1: adrpx0, .LC0 ldr q0, [x0, #:lo12:.LC0] ret i.e. a load, but with this change: f1: fmovv0.4s, 1.0e+0 ret Such immediates are common in e.g. our Math routines in glibc because they are created to extract or mark part of an FP immediate as masks. gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_sve_valid_immediate, aarch64_simd_valid_immediate): Refactor accepting modes and values. (aarch64_float_const_representable_p): Refactor and extract FP checks into ... (aarch64_real_float_const_representable_p): ...This and fix fail fallback from real_to_integer. (aarch64_advsimd_valid_immediate): Use it. gcc/testsuite/ChangeLog: * gcc.target/aarch64/const_create_using_fmov.c: New test. Diff: --- gcc/config/aarch64/aarch64.cc | 282 +++-- .../gcc.target/aarch64/const_create_using_fmov.c | 87 +++ 2 files changed, 241 insertions(+), 128 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 5770491b30ce..e65b24e2ad6a 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -22899,19 +22899,19 @@ aarch64_advsimd_valid_immediate_hs (unsigned int val32, return false; } -/* Return true if replicating VAL64 is a valid immediate for the +/* Return true if replicating VAL64 with mode MODE is a valid immediate for the Advanced SIMD operation described by WHICH. If INFO is nonnull, use it to describe valid immediates. */ static bool aarch64_advsimd_valid_immediate (unsigned HOST_WIDE_INT val64, +scalar_int_mode mode, simd_immediate_info *info, enum simd_immediate_check which) { unsigned int val32 = val64 & 0x; - unsigned int val16 = val64 & 0x; unsigned int val8 = val64 & 0xff; - if (val32 == (val64 >> 32)) + if (mode != DImode) { if ((which & AARCH64_CHECK_ORR) != 0 && aarch64_advsimd_valid_immediate_hs (val32, info, which, @@ -22924,9 +22924,7 @@ aarch64_advsimd_valid_immediate (unsigned HOST_WIDE_INT val64, return true; /* Try using a replicated byte. */ - if (which == AARCH64_CHECK_MOV - && val16 == (val32 >> 16) - && val8 == (val16 >> 8)) + if (which == AARCH64_CHECK_MOV && mode == QImode) { if (info) *info = simd_immediate_info (QImode, val8); @@ -22954,28 +22952,15 @@ aarch64_advsimd_valid_immediate (unsigned HOST_WIDE_INT val64, return false; } -/* Return true if replicating VAL64 gives a valid immediate for an SVE MOV - instruction. If INFO is nonnull, use it to describe valid immediates. */ +/* Return true if replicating IVAL with MODE gives a valid immediate for an SVE + MOV instruction. If INFO is nonnull, use it to describe valid + immediates. */ static bool -aarch64_sve_valid_immediate (unsigned HOST_WIDE_INT val64, +aarch64_sve_valid_immediate (unsigned HOST_WIDE_INT ival, scalar_int_mode mode, simd_immediate_info *info) { - scalar_int_mode mode = DImode; - unsigned int val32 = val64 & 0x; - if (val32 == (val64 >> 32)) -{ - mode = SImode; - unsigned int val16 = val32 & 0x; - if (val16 == (val32 >> 16)) - { - mode = HImode; - unsigned int val8 = val16 & 0xff; - if (val8 == (val16 >> 8)) - mode = QImode; - } -} - HOST_WIDE_INT val = trunc_int_for_mode (val64, mode); + HOST_WIDE_INT val = trunc_int_for_mode (ival, mode); if (IN_RANGE (val, -0x80, 0x7f)) { /* DUP with no shift. */ @@ -22990,7 +22975,7 @@ aarch64_sve_valid_immediate (unsigned HOST_WIDE_INT val64, *info = simd_immediate_info (mode, val); return true; } - if (aarch64_bitmask_imm (val64, mode)) + if (aarch64_bitmask_imm (ival, mode)) { /* DUPM. */ if (info) @@ -23071,6 +23056,91 @@ aarch64_sve_pred_valid_immediate (rtx x, simd_immediate_info *info) return false; } +/* We can only represent floating point constants which will fit in + "quarter-precision" values. These values are characterised by +
[gcc r15-4461] AArch64: use movi d0, #0 to clear SVE registers instead of mov z0.d, #0
https://gcc.gnu.org/g:453d3d90c374d3bb329f1431b7dfb8d0510a88b9 commit r15-4461-g453d3d90c374d3bb329f1431b7dfb8d0510a88b9 Author: Tamar Christina Date: Fri Oct 18 09:44:15 2024 +0100 AArch64: use movi d0, #0 to clear SVE registers instead of mov z0.d, #0 This patch changes SVE to use Adv. SIMD movi 0 to clear SVE registers when not in SVE streaming mode. As the Neoverse Software Optimization guides indicate SVE mov #0 is not a zero cost move. When In streaming mode we continue to use SVE's mov to clear the registers. Tests have already been updated. gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_output_sve_mov_immediate): Use fmov for SVE zeros. Diff: --- gcc/config/aarch64/aarch64.cc | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index e65b24e2ad6a..3ab550acc7cd 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -25516,8 +25516,11 @@ aarch64_output_sve_mov_immediate (rtx const_vector) } } - snprintf (templ, sizeof (templ), "mov\t%%0.%c, #" HOST_WIDE_INT_PRINT_DEC, - element_char, INTVAL (info.u.mov.value)); + if (info.u.mov.value == const0_rtx && TARGET_NON_STREAMING) +snprintf (templ, sizeof (templ), "movi\t%%d0, #0"); + else +snprintf (templ, sizeof (templ), "mov\t%%0.%c, #" HOST_WIDE_INT_PRINT_DEC, + element_char, INTVAL (info.u.mov.value)); return templ; }
[gcc r15-4463] middle-end: Fix GSI for gcond root [PR117140]
https://gcc.gnu.org/g:51291ad0f1f89a81de917110af96e019dcd5690c commit r15-4463-g51291ad0f1f89a81de917110af96e019dcd5690c Author: Tamar Christina Date: Fri Oct 18 10:37:28 2024 +0100 middle-end: Fix GSI for gcond root [PR117140] When finding the gsi to use for code of the root statements we should use the one of the original statement rather than the gcond which may be inside a pattern. Without this the emitted instructions may be discarded later. gcc/ChangeLog: PR tree-optimization/117140 * tree-vect-slp.cc (vectorize_slp_instance_root_stmt): Use gsi from original statement. gcc/testsuite/ChangeLog: PR tree-optimization/117140 * gcc.dg/vect/vect-early-break_129-pr117140.c: New test. Diff: --- .../gcc.dg/vect/vect-early-break_129-pr117140.c| 94 ++ gcc/tree-vect-slp.cc | 2 +- 2 files changed, 95 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_129-pr117140.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_129-pr117140.c new file mode 100644 index ..eec7f8db40c7 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_129-pr117140.c @@ -0,0 +1,94 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +typedef signed char int8_t; +typedef short int int16_t; +typedef int int32_t; +typedef long long int int64_t; +typedef unsigned char uint8_t; +typedef short unsigned int uint16_t; +typedef unsigned int uint32_t; +typedef long long unsigned int uint64_t; + +void __attribute__ ((noinline, noclone)) +test_1_TYPE1_uint32_t (uint16_t *__restrict f, uint32_t *__restrict d, + uint16_t x, uint16_t x2, uint32_t y, int n) +{ +for (int i = 0; i < n; ++i) +{ +f[i * 2 + 0] = x; +f[i * 2 + 1] = x2; +d[i] = y; +} +} + +void __attribute__ ((noinline, noclone)) +test_1_TYPE1_int64_t (int32_t *__restrict f, int64_t *__restrict d, int32_t x, + int32_t x2, int64_t y, int n) +{ +for (int i = 0; i < n; ++i) +{ +f[i * 2 + 0] = x; +f[i * 2 + 1] = x2; +d[i] = y; +} +} + +int +main (void) +{ +// This part is necessary for ice to appear though running it by itself does not trigger an ICE +int n_3_TYPE1_uint32_t = 32; +uint16_t x_3_uint16_t = 233; +uint16_t x2_3_uint16_t = 78; +uint32_t y_3_uint32_t = 1234; +uint16_t f_3_uint16_t[33 * 2 + 1] = { 0} ; +uint32_t d_3_uint32_t[33] = { 0} ; +test_1_TYPE1_uint32_t (f_3_uint16_t, d_3_uint32_t, x_3_uint16_t, x2_3_uint16_t, y_3_uint32_t, n_3_TYPE1_uint32_t); +for (int i = 0; +i < n_3_TYPE1_uint32_t; +++i) { +if (f_3_uint16_t[i * 2 + 0] != x_3_uint16_t) __builtin_abort (); +if (f_3_uint16_t[i * 2 + 1] != x2_3_uint16_t) __builtin_abort (); +if (d_3_uint32_t[i] != y_3_uint32_t) __builtin_abort (); +} +for (int i = n_3_TYPE1_uint32_t; +i < n_3_TYPE1_uint32_t + 1; +++i) { +if (f_3_uint16_t[i * 2 + 0] != 0) __builtin_abort (); +if (f_3_uint16_t[i * 2 + 1] != 0) __builtin_abort (); +if (d_3_uint32_t[i] != 0) __builtin_abort (); +} +// If ran without the above section, a different ice appears. see below +int n_3_TYPE1_int64_t = 32; +int32_t x_3_int32_t = 233; +int32_t x2_3_int32_t = 78; +int64_t y_3_int64_t = 1234; +int32_t f_3_int32_t[33 * 2 + 1] = { 0 }; +int64_t d_3_int64_t[33] = { 0 }; +test_1_TYPE1_int64_t (f_3_int32_t, d_3_int64_t, x_3_int32_t, x2_3_int32_t, + y_3_int64_t, n_3_TYPE1_int64_t); +for (int i = 0; i < n_3_TYPE1_int64_t; ++i) +{ +if (f_3_int32_t[i * 2 + 0] != x_3_int32_t) +__builtin_abort (); +if (f_3_int32_t[i * 2 + 1] != x2_3_int32_t) +__builtin_abort (); +if (d_3_int64_t[i] != y_3_int64_t) +__builtin_abort (); +} + +for (int i = n_3_TYPE1_int64_t; i < n_3_TYPE1_int64_t + 1; ++i) +{ +if (f_3_int32_t[i * 2 + 0] != 0) +__builtin_abort (); +if (f_3_int32_t[i * 2 + 1] != 0) +__builtin_abort (); +if (d_3_int64_t[i] != 0) +__builtin_abort (); +} + +return 0; +} diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index d35c2ea02dce..9276662fa0f1 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -11167,7 +11167,7 @@ vectorize_slp_instance_root_stmt (vec_info *vinfo, slp_tree node, slp_instance i can't support lane
[gcc r15-4459] AArch64: update testsuite to account for new zero moves
https://gcc.gnu.org/g:fc3507927768c3df425a0b5c0e4051eb8bb1ccf0 commit r15-4459-gfc3507927768c3df425a0b5c0e4051eb8bb1ccf0 Author: Tamar Christina Date: Fri Oct 18 09:42:46 2024 +0100 AArch64: update testsuite to account for new zero moves The patch series will adjust how zeros are created. In principal it doesn't matter the exact lane size a zero gets created on but this makes the tests a bit fragile. This preparation patch will update the testsuite to accept multiple variants of ways to create vector zeros to accept both the current syntax and the one being transitioned to in the series. gcc/testsuite/ChangeLog: * gcc.target/aarch64/ldp_stp_18.c: Update zero regexpr. * gcc.target/aarch64/memset-corner-cases.c: Likewise. * gcc.target/aarch64/sme/acle-asm/revd_bf16.c: Likewise. * gcc.target/aarch64/sme/acle-asm/revd_f16.c: Likewise. * gcc.target/aarch64/sme/acle-asm/revd_f32.c: Likewise. * gcc.target/aarch64/sme/acle-asm/revd_f64.c: Likewise. * gcc.target/aarch64/sme/acle-asm/revd_s16.c: Likewise. * gcc.target/aarch64/sme/acle-asm/revd_s32.c: Likewise. * gcc.target/aarch64/sme/acle-asm/revd_s64.c: Likewise. * gcc.target/aarch64/sme/acle-asm/revd_s8.c: Likewise. * gcc.target/aarch64/sme/acle-asm/revd_u16.c: Likewise. * gcc.target/aarch64/sme/acle-asm/revd_u32.c: Likewise. * gcc.target/aarch64/sme/acle-asm/revd_u64.c: Likewise. * gcc.target/aarch64/sme/acle-asm/revd_u8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/acge_f16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/acge_f32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/acge_f64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/acgt_f16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/acgt_f32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/acgt_f64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/acle_f16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/acle_f32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/acle_f64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/aclt_f16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/aclt_f32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/aclt_f64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/bic_s8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/bic_u8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/cmpuo_f16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/cmpuo_f32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/cmpuo_f64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/dup_f16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/dup_f32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/dup_f64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/dup_s16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/dup_s32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/dup_s64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/dup_s8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/dup_u16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/dup_u32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/dup_u64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/dup_u8.c: Likewise. * gcc.target/aarch64/sve/const_fold_div_1.c: Likewise. * gcc.target/aarch64/sve/const_fold_mul_1.c: Likewise. * gcc.target/aarch64/sve/dup_imm_1.c: Likewise. * gcc.target/aarch64/sve/fdup_1.c: Likewise. * gcc.target/aarch64/sve/fold_div_zero.c: Likewise. * gcc.target/aarch64/sve/fold_mul_zero.c: Likewise. * gcc.target/aarch64/sve/pcs/args_2.c: Likewise. * gcc.target/aarch64/sve/pcs/args_3.c: Likewise. * gcc.target/aarch64/sve/pcs/args_4.c: Likewise. * gcc.target/aarch64/vect-fmovd-zero.c: Likewise. Diff: --- gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c | 2 +- .../gcc.target/aarch64/memset-corner-cases.c | 2 +- .../gcc.target/aarch64/sme/acle-asm/revd_bf16.c| 2 +- .../gcc.target/aarch64/sme/acle-asm/revd_f16.c | 2 +- .../gcc.target/aarch64/sme/acle-asm/revd_f32.c | 2 +- .../gcc.target/aarch64/sme/acle-asm/revd_f64.c | 2 +- .../gcc.target/aarch64/sme/acle-asm/revd_s16.c | 2 +- .../gcc.target/aarch64/sme/acle-asm/revd_s32.c | 2 +- .../gcc.target/aarch64/sme/acle-asm/revd_s64.c | 2 +- .../gcc.target/aarch64/sme/acle-asm/revd_s8.c | 2 +- .../gcc.target/aarch64/sme/acle-asm/revd_u16.c | 2 +- .../gcc.target/aarch64/sme/acle-asm/revd_u32.c | 2 +- .../gcc.target/aarch64/sme/acle-asm/revd_u64.c | 2 +- .../gcc.target/aarch64/sme/acle-asm/revd_u8.c | 2 +- .../gc
[gcc r15-4462] middle-end: Fix VEC_PERM_EXPR lowering since relaxation of vector sizes
https://gcc.gnu.org/g:55f898008ec8235897cf56c89f5599c3ec1bc963 commit r15-4462-g55f898008ec8235897cf56c89f5599c3ec1bc963 Author: Tamar Christina Date: Fri Oct 18 10:36:19 2024 +0100 middle-end: Fix VEC_PERM_EXPR lowering since relaxation of vector sizes In GCC 14 VEC_PERM_EXPR was relaxed to be able to permute to a 2x larger vector than the size of the input vectors. However various passes and transformations were not updated to account for this. I have patches in these area that I will be upstreaming with individual patches that expose them. This one is that vectlower tries to lower based on the size of the input vectors rather than the size of the output. As a consequence it creates an invalid vector of half the size. Luckily we ICE because the resulting nunits doesn't match the vector size. gcc/ChangeLog: * tree-vect-generic.cc (lower_vec_perm): Use output vector size instead of input vector when determining output nunits. gcc/testsuite/ChangeLog: * gcc.dg/vec-perm-lower.c: New test. Diff: --- gcc/testsuite/gcc.dg/vec-perm-lower.c | 16 gcc/tree-vect-generic.cc | 7 --- 2 files changed, 20 insertions(+), 3 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vec-perm-lower.c b/gcc/testsuite/gcc.dg/vec-perm-lower.c new file mode 100644 index ..da738fbeed80 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vec-perm-lower.c @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* { dg-options "-fgimple -O2" } */ + +typedef char v8qi __attribute__ ((vector_size (8))); +typedef char v16qi __attribute__ ((vector_size (16))); + +v16qi __GIMPLE (ssa) +foo (v8qi a, v8qi b) +{ + v16qi _5; + + __BB(2): + _5 = __VEC_PERM (a, b, _Literal (unsigned char [[gnu::vector_size(16)]]) { _Literal (unsigned char) 0, _Literal (unsigned char) 16, _Literal (unsigned char) 1, _Literal (unsigned char) 17, _Literal (unsigned char) 2, _Literal (unsigned char) 18, _Literal (unsigned char) 3, _Literal (unsigned char) 19, _Literal (unsigned char) 4, _Literal (unsigned char) 20, _Literal (unsigned char) 5, _Literal (unsigned char) 21, _Literal (unsigned char) 6, _Literal (unsigned char) 22, _Literal (unsigned char) 7, _Literal (unsigned char) 23 }); + return _5; + +} diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc index 3041fb8fcf23..f86f7eabb255 100644 --- a/gcc/tree-vect-generic.cc +++ b/gcc/tree-vect-generic.cc @@ -1500,6 +1500,7 @@ lower_vec_perm (gimple_stmt_iterator *gsi) tree mask = gimple_assign_rhs3 (stmt); tree vec0 = gimple_assign_rhs1 (stmt); tree vec1 = gimple_assign_rhs2 (stmt); + tree res_vect_type = TREE_TYPE (gimple_assign_lhs (stmt)); tree vect_type = TREE_TYPE (vec0); tree mask_type = TREE_TYPE (mask); tree vect_elt_type = TREE_TYPE (vect_type); @@ -1512,7 +1513,7 @@ lower_vec_perm (gimple_stmt_iterator *gsi) location_t loc = gimple_location (gsi_stmt (*gsi)); unsigned i; - if (!TYPE_VECTOR_SUBPARTS (vect_type).is_constant (&elements)) + if (!TYPE_VECTOR_SUBPARTS (res_vect_type).is_constant (&elements)) return; if (TREE_CODE (mask) == SSA_NAME) @@ -1672,9 +1673,9 @@ lower_vec_perm (gimple_stmt_iterator *gsi) } if (constant_p) -constr = build_vector_from_ctor (vect_type, v); +constr = build_vector_from_ctor (res_vect_type, v); else -constr = build_constructor (vect_type, v); +constr = build_constructor (res_vect_type, v); gimple_assign_set_rhs_from_tree (gsi, constr); update_stmt (gsi_stmt (*gsi)); }
[gcc r15-4326] AArch64: rename the SVE2 psel intrinsics to psel_lane [PR116371]
https://gcc.gnu.org/g:306834b7f74ab61160f205e04f5bf35b71f9ec52 commit r15-4326-g306834b7f74ab61160f205e04f5bf35b71f9ec52 Author: Tamar Christina Date: Mon Oct 14 13:58:09 2024 +0100 AArch64: rename the SVE2 psel intrinsics to psel_lane [PR116371] The psel intrinsics. similar to the pext, should be name psel_lane. This corrects the naming. gcc/ChangeLog: PR target/116371 * config/aarch64/aarch64-sve-builtins-sve2.cc (class svpsel_impl): Renamed to ... (class svpsel_lane_impl): ... This and adjust initialization. * config/aarch64/aarch64-sve-builtins-sve2.def (svpsel): Renamed to ... (svpsel_lane): ... This. * config/aarch64/aarch64-sve-builtins-sve2.h (svpsel): Renamed to svpsel_lane. gcc/testsuite/ChangeLog: PR target/116371 * gcc.target/aarch64/sme2/acle-asm/psel_b16.c, gcc.target/aarch64/sme2/acle-asm/psel_b32.c, gcc.target/aarch64/sme2/acle-asm/psel_b64.c, gcc.target/aarch64/sme2/acle-asm/psel_b8.c, gcc.target/aarch64/sme2/acle-asm/psel_c16.c, gcc.target/aarch64/sme2/acle-asm/psel_c32.c, gcc.target/aarch64/sme2/acle-asm/psel_c64.c, gcc.target/aarch64/sme2/acle-asm/psel_c8.c: Renamed to * gcc.target/aarch64/sme2/acle-asm/psel_lane_b16.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_b32.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_b64.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_b8.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_c16.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_c32.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_c64.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_c8.c: ... These. Diff: --- gcc/config/aarch64/aarch64-sve-builtins-sve2.cc| 4 +- gcc/config/aarch64/aarch64-sve-builtins-sve2.def | 2 +- gcc/config/aarch64/aarch64-sve-builtins-sve2.h | 2 +- .../gcc.target/aarch64/sme2/acle-asm/psel_b16.c| 89 -- .../gcc.target/aarch64/sme2/acle-asm/psel_b32.c| 89 -- .../gcc.target/aarch64/sme2/acle-asm/psel_b64.c| 80 --- .../gcc.target/aarch64/sme2/acle-asm/psel_b8.c | 89 -- .../gcc.target/aarch64/sme2/acle-asm/psel_c16.c| 89 -- .../gcc.target/aarch64/sme2/acle-asm/psel_c32.c| 89 -- .../gcc.target/aarch64/sme2/acle-asm/psel_c64.c| 80 --- .../gcc.target/aarch64/sme2/acle-asm/psel_c8.c | 89 -- .../aarch64/sme2/acle-asm/psel_lane_b16.c | 89 ++ .../aarch64/sme2/acle-asm/psel_lane_b32.c | 89 ++ .../aarch64/sme2/acle-asm/psel_lane_b64.c | 80 +++ .../aarch64/sme2/acle-asm/psel_lane_b8.c | 89 ++ .../aarch64/sme2/acle-asm/psel_lane_c16.c | 89 ++ .../aarch64/sme2/acle-asm/psel_lane_c32.c | 89 ++ .../aarch64/sme2/acle-asm/psel_lane_c64.c | 80 +++ .../aarch64/sme2/acle-asm/psel_lane_c8.c | 89 ++ 19 files changed, 698 insertions(+), 698 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc index 146a5459930f..6a20a613f832 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc @@ -234,7 +234,7 @@ public: } }; -class svpsel_impl : public function_base +class svpsel_lane_impl : public function_base { public: rtx @@ -625,7 +625,7 @@ FUNCTION (svpmullb, unspec_based_function, (-1, UNSPEC_PMULLB, -1)) FUNCTION (svpmullb_pair, unspec_based_function, (-1, UNSPEC_PMULLB_PAIR, -1)) FUNCTION (svpmullt, unspec_based_function, (-1, UNSPEC_PMULLT, -1)) FUNCTION (svpmullt_pair, unspec_based_function, (-1, UNSPEC_PMULLT_PAIR, -1)) -FUNCTION (svpsel, svpsel_impl,) +FUNCTION (svpsel_lane, svpsel_lane_impl,) FUNCTION (svqabs, rtx_code_function, (SS_ABS, UNKNOWN, UNKNOWN)) FUNCTION (svqcadd, svqcadd_impl,) FUNCTION (svqcvt, integer_conversion, (UNSPEC_SQCVT, UNSPEC_SQCVTU, diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.def b/gcc/config/aarch64/aarch64-sve-builtins-sve2.def index 4543402f836f..318dfff06f0d 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.def +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.def @@ -235,7 +235,7 @@ DEF_SVE_FUNCTION (svsm4ekey, binary, s_unsigned, none) | AARCH64_FL_SME \ | AARCH64_FL_SM_ON) DEF_SVE_FUNCTION (svclamp, clamp, all_integer, none) -DEF_SVE_FUNCTION (svpsel, select_pred, all_pred_count, none) +DEF_SVE_FUNCTION (svpsel_lane, select_pred, all_pred_count, none) DEF_SVE_FUNCTION (svre
[gcc r15-4327] simplify-rtx: Fix incorrect folding of shift and AND [PR117012]
https://gcc.gnu.org/g:be966baa353dfcc20b76b5a5586ab2494bb0a735 commit r15-4327-gbe966baa353dfcc20b76b5a5586ab2494bb0a735 Author: Tamar Christina Date: Mon Oct 14 14:00:25 2024 +0100 simplify-rtx: Fix incorrect folding of shift and AND [PR117012] The optimization added in r15-1047-g7876cde25cbd2f is using the wrong operaiton to check for uniform constant vectors. The Author intended to check that all the lanes in the vector are the same and so used CONST_VECTOR_DUPLICATE_P. However this only checks that the vector is created from a pattern duplication, but doesn't say how many pattern alternatives make up the duplication. Normally would would need to check this separately or use const_vec_duplicate_p. Without this the optimization incorrectly triggers. gcc/ChangeLog: PR rtl-optimization/117012 * simplify-rtx.cc (simplify_context::simplify_binary_operation_1): Use const_vec_duplicate_p instead of CONST_VECTOR_DUPLICATE_P. gcc/testsuite/ChangeLog: PR rtl-optimization/117012 * gcc.target/aarch64/pr117012.c: New test. Diff: --- gcc/simplify-rtx.cc | 4 ++-- gcc/testsuite/gcc.target/aarch64/pr117012.c | 16 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index dc0d192dd218..4d024ec523b1 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -4088,10 +4088,10 @@ simplify_context::simplify_binary_operation_1 (rtx_code code, if (VECTOR_MODE_P (mode) && GET_CODE (op0) == ASHIFTRT && (CONST_INT_P (XEXP (op0, 1)) || (GET_CODE (XEXP (op0, 1)) == CONST_VECTOR - && CONST_VECTOR_DUPLICATE_P (XEXP (op0, 1)) + && const_vec_duplicate_p (XEXP (op0, 1)) && CONST_INT_P (XVECEXP (XEXP (op0, 1), 0, 0 && GET_CODE (op1) == CONST_VECTOR - && CONST_VECTOR_DUPLICATE_P (op1) + && const_vec_duplicate_p (op1) && CONST_INT_P (XVECEXP (op1, 0, 0))) { unsigned HOST_WIDE_INT shift_count diff --git a/gcc/testsuite/gcc.target/aarch64/pr117012.c b/gcc/testsuite/gcc.target/aarch64/pr117012.c new file mode 100644 index ..537c0fa566c6 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr117012.c @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +#define vector16 __attribute__((vector_size(16))) + +vector16 unsigned char +g (vector16 unsigned char a) +{ + vector16 signed char b = (vector16 signed char)a; + b = b >> 7; + vector16 unsigned char c = (vector16 unsigned char)b; + vector16 unsigned char d = { 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0 }; + return c & d; +} + +/* { dg-final { scan-assembler-times {and\tv[0-9]+\.16b, v[0-9]+\.16b, v[0-9]+\.16b} 1 } } */
[gcc r15-4328] middle-end: copy STMT_VINFO_STRIDED_P when DR is replaced [PR116956]
https://gcc.gnu.org/g:ec3d3ea60a55f25a743a037adda7d10d03ca73b2 commit r15-4328-gec3d3ea60a55f25a743a037adda7d10d03ca73b2 Author: Tamar Christina Date: Mon Oct 14 14:01:24 2024 +0100 middle-end: copy STMT_VINFO_STRIDED_P when DR is replaced [PR116956] When move_dr copies a DR from one statement to another, it seems we've forgotten to copy the STMT_VINFO_STRIDED_P flag. This leaves the new DR in a broken state where it has a non constant stride but isn't marked as strided. This causes the ICE in the PR because dataref analysis fails during epilogue vectorization because there is an assumption in place that while costing may fail for epiloque vectorization, that DR analysis cannot if it succeeded for the main loop. gcc/ChangeLog: PR tree-optimization/116956 * tree-vectorizer.cc (vec_info::move_dr): Copy STMT_VINFO_STRIDED_P. gcc/testsuite/ChangeLog: PR tree-optimization/116956 * gfortran.dg/vect/pr116956.f90: New test. Diff: --- gcc/testsuite/gfortran.dg/vect/pr116956.f90 | 11 +++ gcc/tree-vectorizer.cc | 2 ++ 2 files changed, 13 insertions(+) diff --git a/gcc/testsuite/gfortran.dg/vect/pr116956.f90 b/gcc/testsuite/gfortran.dg/vect/pr116956.f90 new file mode 100644 index ..3ce4d1ab7927 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/vect/pr116956.f90 @@ -0,0 +1,11 @@ +! { dg-do compile } +! { dg-require-effective-target vect_int } +! { dg-additional-options "-mcpu=neoverse-v2 -Ofast" { target aarch64*-*-* } } + +SUBROUTINE nesting_offl_init(u, v, mask) + IMPLICIT NONE + real :: u(:) + real :: v(:) + integer :: mask(:) + u = MERGE( u, v, BTEST (mask, 1) ) +END SUBROUTINE nesting_offl_init diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc index fed12c41f9cb..0c471c5580d3 100644 --- a/gcc/tree-vectorizer.cc +++ b/gcc/tree-vectorizer.cc @@ -610,6 +610,8 @@ vec_info::move_dr (stmt_vec_info new_stmt_info, stmt_vec_info old_stmt_info) = STMT_VINFO_DR_WRT_VEC_LOOP (old_stmt_info); STMT_VINFO_GATHER_SCATTER_P (new_stmt_info) = STMT_VINFO_GATHER_SCATTER_P (old_stmt_info); + STMT_VINFO_STRIDED_P (new_stmt_info) += STMT_VINFO_STRIDED_P (old_stmt_info); } /* Permanently remove the statement described by STMT_INFO from the
[gcc r14-10909] AArch64: backport Neoverse and Cortex CPU definitions
https://gcc.gnu.org/g:05d54bcdc5395a9d3df36c8b640579a0558c89f0 commit r14-10909-g05d54bcdc5395a9d3df36c8b640579a0558c89f0 Author: Tamar Christina Date: Fri Nov 8 18:12:32 2024 + AArch64: backport Neoverse and Cortex CPU definitions This is a conservative backport of a few core definitions backporting only the core definitions and mapping them to their closest cost model that exist on the branches. gcc/ChangeLog: * config/aarch64/aarch64-cores.def (cortex-a725, cortex-x925, neoverse-n3, neoverse-v3, neoverse-v3ae): New. * config/aarch64/aarch64-tune.md: Regenerate * doc/invoke.texi: Document them. Diff: --- gcc/config/aarch64/aarch64-cores.def | 6 ++ gcc/config/aarch64/aarch64-tune.md | 2 +- gcc/doc/invoke.texi | 10 ++ 3 files changed, 13 insertions(+), 5 deletions(-) diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def index 1ab09ea5f720..a919ab7d8a5a 100644 --- a/gcc/config/aarch64/aarch64-cores.def +++ b/gcc/config/aarch64/aarch64-cores.def @@ -179,6 +179,7 @@ AARCH64_CORE("cortex-a710", cortexa710, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, AARCH64_CORE("cortex-a715", cortexa715, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd4d, -1) AARCH64_CORE("cortex-a720", cortexa720, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversen2, 0x41, 0xd81, -1) +AARCH64_CORE("cortex-a725", cortexa725, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversen2, 0x41, 0xd87, -1) AARCH64_CORE("cortex-x2", cortexx2, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd48, -1) @@ -186,11 +187,16 @@ AARCH64_CORE("cortex-x3", cortexx3, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8M AARCH64_CORE("cortex-x4", cortexx4, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversen2, 0x41, 0xd81, -1) +AARCH64_CORE("cortex-x925", cortexx925, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversen2, 0x41, 0xd85, -1) + AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1) AARCH64_CORE("cobalt-100", cobalt100, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1) +AARCH64_CORE("neoverse-n3", neoversen3, cortexa57, V9_2A, (SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd8e, -1) AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1) AARCH64_CORE("grace", grace, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE), neoversev2, 0x41, 0xd4f, -1) +AARCH64_CORE("neoverse-v3", neoversev3, cortexa57, V9_2A, (SVE2_BITPERM, RNG, LS64, MEMTAG, PROFILE), neoversev2, 0x41, 0xd84, -1) +AARCH64_CORE("neoverse-v3ae", neoversev3ae, cortexa57, V9_2A, (SVE2_BITPERM, RNG, LS64, MEMTAG, PROFILE), neoversev2, 0x41, 0xd83, -1) AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1) diff --git a/gcc/config/aarch64/aarch64-tune.md b/gcc/config/aarch64/aarch64-tune.md index 06e8680607bd..35b27ddb8831 100644 --- a/gcc/config/aarch64/aarch64-tune.md +++ b/gcc/config/aarch64/aarch64-tune.md @@ -1,5 +1,5 @@ ;; -*- buffer-read-only: t -*- ;; Generated automatically by gentune.sh from aarch64-cores.def (define_attr "tune" - "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,fujitsu_monaka,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,demeter,generic,generic_armv8_a,generic_armv9_a" + "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,fujitsu_monaka,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cort
[gcc r15-4802] middle-end: Lower all gconds during vector pattern matching [PR117176]
https://gcc.gnu.org/g:d2f9159cfe7ea904e6476cabefea0c6ac9532e29 commit r15-4802-gd2f9159cfe7ea904e6476cabefea0c6ac9532e29 Author: Tamar Christina Date: Thu Oct 31 12:50:23 2024 + middle-end: Lower all gconds during vector pattern matching [PR117176] I have been taking a look at boolean handing once more in the vectorizer. There are two situation to consider: 1. when the boolean being created are created from comparing data inputs then for the resulting vector boolean we need to know the vector type and the precision. In this case, when we have an operation such as NOT on the data element, this has to be lowered to XOR because the truncation to the vector precision needs to be explicit. 2. when the boolean being created comes from another boolean operation, then we don't need to lower NOT, as the precision doesn't change. We don't do any lowering for these (as denoted in check_bool_pattern) and instead the precision is copied from the element feeding the boolean statement during VF analysis. For early break gcond lowering in order to correctly handle the second scenario above we punted the lowering of VECT_SCALAR_BOOLEAN_TYPE_P comparisons that were already in the right shape. e.g. e != 0 where e is a boolean does not need any lowering. The issue however is that the statement feeding e may need to be lowered in the case where it's a data expression. This patch changes a bit how we do the lowering. We now always emit an additional compare. e.g. if the input is; if (e != 0) where is a boolean we would punt on thi before, but now we generate f = e != 0 if (f != 0) We then use the same infrastructre as recog_bool to ask it to lower f, and in doing so handle and boolean conversions that need to be lowered. Because we now guarantee that f is an internal def we can also simplify the SLP building code. When e is a boolean, the precision we build for f needs to reflect the precision of the operation feeding e. To get this value we use integer_type_for_mask the same way recog_bool does, and if it's defined (e.g. we have a data conversions somewhere) we pass that precision on instead. This gets us the correct VF on the newly lowered boolean expressions. gcc/ChangeLog: PR tree-optimization/117176 * tree-vect-patterns.cc (vect_recog_gcond_pattern): Lower all gconds. * tree-vect-slp.cc (vect_analyze_slp): No longer check for in vect def. gcc/testsuite/ChangeLog: PR tree-optimization/117176 * gcc.dg/vect/vect-early-break_130-pr117176.c: New test. Diff: --- .../gcc.dg/vect/vect-early-break_130-pr117176.c| 21 gcc/tree-vect-patterns.cc | 19 ++- gcc/tree-vect-slp.cc | 39 +- 3 files changed, 40 insertions(+), 39 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_130-pr117176.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_130-pr117176.c new file mode 100644 index ..841dcce284dd --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_130-pr117176.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +struct ColorSpace { + int componentCt; +}; + +struct Psnr { + double psnr[3]; +}; + +int f(struct Psnr psnr, struct ColorSpace colorSpace) { + int i, hitsTarget = 1; + + for (i = 1; i < colorSpace.componentCt && hitsTarget; ++i) +hitsTarget = !(psnr.psnr[i] < 1); + + return hitsTarget; +} diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index 945e7d2dc45d..a708234304fe 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -5426,17 +5426,19 @@ vect_recog_gcond_pattern (vec_info *vinfo, if (VECTOR_TYPE_P (scalar_type)) return NULL; - if (code == NE_EXPR - && zerop (rhs) - && VECT_SCALAR_BOOLEAN_TYPE_P (scalar_type)) -return NULL; + /* If the input is a boolean then try to figure out the precision that the + vector type should use. We cannot use the scalar precision as this would + later mismatch. This is similar to what recog_bool does. */ + if (VECT_SCALAR_BOOLEAN_TYPE_P (scalar_type)) +{ + if (tree stype = integer_type_for_mask (lhs, vinfo)) + scalar_type = stype; +} - tree vecitype = get_vectype_for_scalar_type (vinfo, scalar_type); - if (vecitype == NULL_TREE) + tree vectype = get_mask_type_for_scalar_type (vinfo, scalar_type); + if (vectype == NULL_TREE) return NULL; - tree vectype = truth_type_for (vecitype); - tree new_lhs = vect_recog_temp_ssa_var (boolean_type_node, NULL
[gcc r15-3792] middle-end: Insert invariant instructions before the gsi [PR116812]
https://gcc.gnu.org/g:09892448ebd8c396a26b2c09ba71f1e5a8dc42d7 commit r15-3792-g09892448ebd8c396a26b2c09ba71f1e5a8dc42d7 Author: Tamar Christina Date: Mon Sep 23 11:45:43 2024 +0100 middle-end: Insert invariant instructions before the gsi [PR116812] The new invariant statements should be inserted before the current statement and not after. This goes fine 99% of the time but when the current statement is a gcond the control flow gets corrupted. gcc/ChangeLog: PR tree-optimization/116812 * tree-vect-slp.cc (vect_slp_region): Fix insertion. gcc/testsuite/ChangeLog: PR tree-optimization/116812 * gcc.dg/vect/pr116812.c: New test. Diff: --- gcc/testsuite/gcc.dg/vect/pr116812.c | 17 + gcc/tree-vect-slp.cc | 6 ++ 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr116812.c b/gcc/testsuite/gcc.dg/vect/pr116812.c new file mode 100644 index ..3e83c13d94bd --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr116812.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O2 -fno-tree-dce -fno-tree-dse" } */ + +int a, b, c, d, e, f[2], g, h; +int k(int j) { return 2 >> a ? 2 >> a : a; } +int main() { + int i; + for (; g; g = k(d = 0)) +; + if (a) +b && h; + for (e = 0; e < 2; e++) +c = d & 1 ? d : 0; + for (i = 0; i < 2; i++) +f[i] = 0; + return 0; +} diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 600987dd6e5d..7161492f5114 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -9168,10 +9168,8 @@ vect_slp_region (vec bbs, vec datarefs, dump_printf_loc (MSG_NOTE, vect_location, "-->generating invariant statements\n"); - gimple_stmt_iterator gsi; - gsi = gsi_after_labels (bb_vinfo->bbs[0]); - gsi_insert_seq_after (&gsi, bb_vinfo->inv_pattern_def_seq, - GSI_CONTINUE_LINKING); + bb_vinfo->insert_seq_on_entry (NULL, +bb_vinfo->inv_pattern_def_seq); } } else
[gcc r15-3767] aarch64: Take into account when VF is higher than known scalar iters
https://gcc.gnu.org/g:e84e5d034124c6733d3b36d8623c56090d4d17f7 commit r15-3767-ge84e5d034124c6733d3b36d8623c56090d4d17f7 Author: Tamar Christina Date: Sun Sep 22 13:34:10 2024 +0100 aarch64: Take into account when VF is higher than known scalar iters Consider low overhead loops like: void foo (char *restrict a, int *restrict b, int *restrict c, int n) { for (int i = 0; i < 9; i++) { int res = c[i]; int t = b[i]; if (a[i] != 0) res = t; c[i] = res; } } For such loops we use latency only costing since the loop bounds is known and small. The current costing however does not consider the case where niters < VF. So when comparing the scalar vs vector costs it doesn't keep in mind that the scalar code can't perform VF iterations. This makes it overestimate the cost for the scalar loop and we incorrectly vectorize. This patch takes the minimum of the VF and niters in such cases. Before the patch we generate: note: Original vector body cost = 46 note: Vector loop iterates at most 1 times note: Scalar issue estimate: note:load operations = 2 note:store operations = 1 note:general operations = 1 note:reduction latency = 0 note:estimated min cycles per iteration = 1.00 note:estimated cycles per vector iteration (for VF 32) = 32.00 note: SVE issue estimate: note:load operations = 5 note:store operations = 4 note:general operations = 11 note:predicate operations = 12 note:reduction latency = 0 note:estimated min cycles per iteration without predication = 5.50 note:estimated min cycles per iteration for predication = 12.00 note:estimated min cycles per iteration = 12.00 note: Low iteration count, so using pure latency costs note: Cost model analysis: vs after: note: Original vector body cost = 46 note: Known loop bounds, capping VF to 9 for analysis note: Vector loop iterates at most 1 times note: Scalar issue estimate: note:load operations = 2 note:store operations = 1 note:general operations = 1 note:reduction latency = 0 note:estimated min cycles per iteration = 1.00 note:estimated cycles per vector iteration (for VF 9) = 9.00 note: SVE issue estimate: note:load operations = 5 note:store operations = 4 note:general operations = 11 note:predicate operations = 12 note:reduction latency = 0 note:estimated min cycles per iteration without predication = 5.50 note:estimated min cycles per iteration for predication = 12.00 note:estimated min cycles per iteration = 12.00 note: Increasing body cost to 1472 because the scalar code could issue within the limit imposed by predicate operations note: Low iteration count, so using pure latency costs note: Cost model analysis: gcc/ChangeLog: * config/aarch64/aarch64.cc (adjust_body_cost): Cap VF for low iteration loops. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/asrdiv_4.c: Update bounds. * gcc.target/aarch64/sve/cond_asrd_2.c: Likewise. * gcc.target/aarch64/sve/cond_uxt_6.c: Likewise. * gcc.target/aarch64/sve/cond_uxt_7.c: Likewise. * gcc.target/aarch64/sve/cond_uxt_8.c: Likewise. * gcc.target/aarch64/sve/miniloop_1.c: Likewise. * gcc.target/aarch64/sve/spill_6.c: Likewise. * gcc.target/aarch64/sve/sve_iters_low_1.c: New test. * gcc.target/aarch64/sve/sve_iters_low_2.c: New test. Diff: --- gcc/config/aarch64/aarch64.cc| 13 + gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c | 12 ++-- gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c | 12 ++-- gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c| 8 gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c| 8 gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c| 8 gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c| 2 +- gcc/testsuite/gcc.target/aarch64/sve/spill_6.c | 8 .../gcc.target/aarch64/sve/sve_iters_low_1.c | 17 + .../gcc.target/aarch64/sve/sve_iters_low_2.c | 20 10 files changed, 79 insertions(+), 29 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 92763d403c75..68913beaee20 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -17565,6 +17565,19 @@ adjust_body_cost (loop_vec_info loop_vinfo, dump_printf_loc (MSG_NOTE, vect_location, "Origina
[gcc r15-3768] middle-end: lower COND_EXPR into gimple form in vect_recog_bool_pattern
https://gcc.gnu.org/g:4150bcd205ebb60b949224758c05012c0dfab7a7 commit r15-3768-g4150bcd205ebb60b949224758c05012c0dfab7a7 Author: Tamar Christina Date: Sun Sep 22 13:38:49 2024 +0100 middle-end: lower COND_EXPR into gimple form in vect_recog_bool_pattern Currently the vectorizer cheats when lowering COND_EXPR during bool recog. In the cases where the conditonal is loop invariant or non-boolean it instead converts the operation back into GENERIC and hides much of the operation from the analysis part of the vectorizer. i.e. a ? b : c is transformed into: a != 0 ? b : c however by doing so we can't perform any optimization on the mask as they aren't explicit until quite late during codegen. To fix this this patch lowers booleans earlier and so ensures that we are always in GIMPLE. For when the value is a loop invariant boolean we have to generate an additional conversion from bool to the integer mask form. This is done by creating a loop invariant a ? -1 : 0 with the target mask precision and then doing a normal != 0 comparison on that. To support this the patch also adds the ability to during pattern matching create a loop invariant pattern that won't be seen by the vectorizer and will instead me materialized inside the loop preheader in the case of loops, or in the case of BB vectorization it materializes it in the first BB in the region. gcc/ChangeLog: * tree-vect-patterns.cc (append_inv_pattern_def_seq): New. (vect_recog_bool_pattern): Lower COND_EXPRs. * tree-vect-slp.cc (vect_slp_region): Materialize loop invariant statements. * tree-vect-loop.cc (vect_transform_loop): Likewise. * tree-vect-stmts.cc (vectorizable_comparison_1): Remove VECT_SCALAR_BOOLEAN_TYPE_P handling for vectype. * tree-vectorizer.cc (vec_info::vec_info): Initialize inv_pattern_def_seq. * tree-vectorizer.h (LOOP_VINFO_INV_PATTERN_DEF_SEQ): New. (class vec_info): Add inv_pattern_def_seq. gcc/testsuite/ChangeLog: * gcc.dg/vect/bb-slp-conditional_store_1.c: New test. * gcc.dg/vect/vect-conditional_store_5.c: New test. * gcc.dg/vect/vect-conditional_store_6.c: New test. Diff: --- .../gcc.dg/vect/bb-slp-conditional_store_1.c | 15 + .../gcc.dg/vect/vect-conditional_store_5.c | 28 .../gcc.dg/vect/vect-conditional_store_6.c | 24 + gcc/tree-vect-loop.cc | 12 +++ gcc/tree-vect-patterns.cc | 39 -- gcc/tree-vect-slp.cc | 14 gcc/tree-vect-stmts.cc | 6 +--- gcc/tree-vectorizer.cc | 3 +- gcc/tree-vectorizer.h | 7 9 files changed, 139 insertions(+), 9 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-conditional_store_1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-conditional_store_1.c new file mode 100644 index ..650a3bfbfb1d --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-conditional_store_1.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target vect_float } */ + +/* { dg-additional-options "-mavx2" { target avx2 } } */ +/* { dg-additional-options "-march=armv9-a" { target aarch64-*-* } } */ + +void foo3 (float *restrict a, int *restrict c) +{ +#pragma GCC unroll 8 + for (int i = 0; i < 8; i++) +c[i] = a[i] > 1.0; +} + +/* { dg-final { scan-tree-dump "vectorized using SLP" "slp1" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_5.c b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_5.c new file mode 100644 index ..37d60fa76351 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_5.c @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target vect_masked_store } */ + +/* { dg-additional-options "-mavx2" { target avx2 } } */ +/* { dg-additional-options "-march=armv9-a" { target aarch64-*-* } } */ + +#include + +void foo3 (float *restrict a, int *restrict b, int *restrict c, int n, int stride) +{ + if (stride <= 1) +return; + + bool ai = a[0]; + + for (int i = 0; i < n; i++) +{ + int res = c[i]; + int t = b[i+stride]; + if (ai) +t = res; + c[i] = t; +} +} + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ +/* { dg-final { scan-tree-dump-not "VEC_COND_EXPR " "vect" { target aarch64-*-* } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_6.c b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_6.c new file mode 100644 index ..5e1aedf3726b --- /
[gcc r15-3800] aarch64: store signing key and signing method in DWARF _Unwind_FrameState
https://gcc.gnu.org/g:f531673917e4f80ad51eda0d806f0479c501a907 commit r15-3800-gf531673917e4f80ad51eda0d806f0479c501a907 Author: Matthieu Longo Date: Mon Sep 23 15:03:30 2024 +0100 aarch64: store signing key and signing method in DWARF _Unwind_FrameState This patch is only a refactoring of the existing implementation of PAuth and returned-address signing. The existing behavior is preserved. _Unwind_FrameState already contains several CIE and FDE information (see the attributes below the comment "The information we care about from the CIE/FDE" in libgcc/unwind-dw2.h). The patch aims at moving the information from DWARF CIE (signing key stored in the augmentation string) and FDE (the used signing method) into _Unwind_FrameState along the already-stored CIE and FDE information. Note: those information have to be saved in frame_state_reg_info instead of _Unwind_FrameState as they need to be savable by DW_CFA_remember_state and restorable by DW_CFA_restore_state, that both rely on the attribute "prev". Those new information in _Unwind_FrameState simplifies the look-up of the signing key when the return address is demangled. It also allows future signing methods to be easily added. _Unwind_FrameState is not a part of the public API of libunwind, so the change is backward compatible. A new architecture-specific handler MD_ARCH_EXTENSION_FRAME_INIT allows to reset values (if needed) in the frame state and unwind context before changing the frame state to the caller context. A new architecture-specific handler MD_ARCH_EXTENSION_CIE_AUG_HANDLER isolates the architecture-specific augmentation strings in AArch64 backend, and allows others architectures to reuse augmentation strings that would have clashed with AArch64 DWARF extensions. aarch64_demangle_return_addr, DW_CFA_AARCH64_negate_ra_state and DW_CFA_val_expression cases in libgcc/unwind-dw2-execute_cfa.h were documented to clarify where the value of the RA state register is stored (FS and CONTEXT respectively). libgcc/ChangeLog: * config/aarch64/aarch64-unwind.h (AARCH64_DWARF_RA_STATE_MASK): The mask for RA state register. (aarch64_ra_signing_method_t): The diversifiers used to sign a function's return address. (aarch64_pointer_auth_key): The key used to sign a function's return address. (aarch64_cie_signed_with_b_key): Deleted as the signing key is available now in _Unwind_FrameState. (MD_ARCH_EXTENSION_CIE_AUG_HANDLER): New CIE augmentation string handler for architecture extensions. (MD_ARCH_EXTENSION_FRAME_INIT): New architecture-extension initialization routine for DWARF frame state and context before execution of DWARF instructions. (aarch64_context_ra_state_get): Read RA state register from CONTEXT. (aarch64_ra_state_get): Read RA state register from FS. (aarch64_ra_state_set): Write RA state register into FS. (aarch64_ra_state_toggle): Toggle RA state register in FS. (aarch64_cie_aug_handler): Handler AArch64 augmentation strings. (aarch64_arch_extension_frame_init): Initialize defaults for the signing key (PAUTH_KEY_A), and RA state register (RA_no_signing). (aarch64_demangle_return_addr): Rely on the frame registers and the signing_key attribute in _Unwind_FrameState. * unwind-dw2-execute_cfa.h: Use the right alias DW_CFA_AARCH64_negate_ra_state for __aarch64__ instead of DW_CFA_GNU_window_save. (DW_CFA_AARCH64_negate_ra_state): Save the signing method in RA state register. Toggle RA state register without resetting 'how' to REG_UNSAVED. * unwind-dw2.c: (extract_cie_info): Save the signing key in the current _Unwind_FrameState while parsing the augmentation data. (uw_frame_state_for): Reset some attributes related to architecture extensions in _Unwind_FrameState. (uw_update_context): Move authentication code to AArch64 unwinding. * unwind-dw2.h (enum register_rule): Give a name to the existing enum for the register rules, and replace 'unsigned char' by 'enum register_rule' to facilitate debugging in GDB. (_Unwind_FrameState): Add a new architecture-extension attribute to store the signing key. Diff: --- libgcc/config/aarch64/aarch64-unwind.h | 145 +++-- libgcc/unwind-dw2-execute_cfa.h| 26 +++--- libgcc/unwind-dw2.c| 19 +++-- libgcc/unwind-dw2.h| 17 +++- 4 files changed, 159 insertions(+), 48 deletions(-) diff --git a/libgcc/config/a
[gcc r15-3802] libgcc: hide CIE and FDE data for DWARF architecture extensions behind a handler.
https://gcc.gnu.org/g:bdf41d627c13bc5f0dc676991f4513daa9d9ae36 commit r15-3802-gbdf41d627c13bc5f0dc676991f4513daa9d9ae36 Author: Matthieu Longo Date: Mon Sep 23 15:03:37 2024 +0100 libgcc: hide CIE and FDE data for DWARF architecture extensions behind a handler. This patch provides a new handler MD_ARCH_FRAME_STATE_T to hide an architecture-specific structure containing CIE and FDE data related to DWARF architecture extensions. Hiding the architecture-specific attributes behind a handler has the following benefits: 1. isolating those data from the generic ones in _Unwind_FrameState 2. avoiding casts to custom types. 3. preserving typing information when debugging with GDB, and so facilitating their printing. This approach required to add a new header md-unwind-def.h included at the top of libgcc/unwind-dw2.h, and redirecting to the corresponding architecture header via a symbolic link. An obvious drawback is the increase in complexity with macros, and headers. It also caused a split of architecture definitions between md-unwind-def.h (types definitions used in unwind-dw2.h) and md-unwind.h (local types definitions and handlers implementations). The naming of md-unwind.h with .h extension is a bit misleading as the file is only included in the middle of unwind-dw2.c. Changing this naming would require modification of others backends, which I prefered to abstain from. Overall the benefits are worth the added complexity from my perspective. libgcc/ChangeLog: * Makefile.in: New target for symbolic link to md-unwind-def.h * config.host: New parameter md_unwind_def_header. Set it to aarch64/aarch64-unwind-def.h for AArch64 targets, or no-unwind.h by default. * config/aarch64/aarch64-unwind.h (aarch64_pointer_auth_key): Move to aarch64-unwind-def.h (aarch64_cie_aug_handler): Update. (aarch64_arch_extension_frame_init): Update. (aarch64_demangle_return_addr): Update. * configure.ac: New substitute variable md_unwind_def_header. * unwind-dw2.h (defined): MD_ARCH_FRAME_STATE_T. * config/aarch64/aarch64-unwind-def.h: New file. * configure: Regenerate. * config/no-unwind.h: Updated comment Diff: --- libgcc/Makefile.in | 6 - libgcc/config.host | 13 -- libgcc/config/aarch64/aarch64-unwind-def.h | 41 ++ libgcc/config/aarch64/aarch64-unwind.h | 14 -- libgcc/config/no-unwind.h | 3 ++- libgcc/configure | 2 ++ libgcc/configure.ac| 1 + libgcc/unwind-dw2.h| 6 +++-- 8 files changed, 71 insertions(+), 15 deletions(-) diff --git a/libgcc/Makefile.in b/libgcc/Makefile.in index 0e46e9ef7686..ffc45f212672 100644 --- a/libgcc/Makefile.in +++ b/libgcc/Makefile.in @@ -47,6 +47,7 @@ with_aix_soname = @with_aix_soname@ solaris_ld_v2_maps = @solaris_ld_v2_maps@ enable_execute_stack = @enable_execute_stack@ unwind_header = @unwind_header@ +md_unwind_def_header = @md_unwind_def_header@ md_unwind_header = @md_unwind_header@ sfp_machine_header = @sfp_machine_header@ thread_header = @thread_header@ @@ -358,13 +359,16 @@ SHLIBUNWIND_INSTALL = # Create links to files specified in config.host. -LIBGCC_LINKS = enable-execute-stack.c unwind.h md-unwind-support.h \ +LIBGCC_LINKS = enable-execute-stack.c \ + unwind.h md-unwind-def.h md-unwind-support.h \ sfp-machine.h gthr-default.h enable-execute-stack.c: $(srcdir)/$(enable_execute_stack) -$(LN_S) $< $@ unwind.h: $(srcdir)/$(unwind_header) -$(LN_S) $< $@ +md-unwind-def.h: $(srcdir)/config/$(md_unwind_def_header) + -$(LN_S) $< $@ md-unwind-support.h: $(srcdir)/config/$(md_unwind_header) -$(LN_S) $< $@ sfp-machine.h: $(srcdir)/config/$(sfp_machine_header) diff --git a/libgcc/config.host b/libgcc/config.host index 4fb4205478a8..5c6b656531ff 100644 --- a/libgcc/config.host +++ b/libgcc/config.host @@ -51,8 +51,10 @@ # If either is set, EXTRA_PARTS and # EXTRA_MULTILIB_PARTS inherited from the GCC # subdirectory will be ignored. -# md_unwind_headerThe name of a header file defining -# MD_FALLBACK_FRAME_STATE_FOR. +# md_unwind_def_header The name of a header file defining architecture +# -specific frame information types for unwinding. +# md_unwind_headerThe name of a header file defining architecture +# -specific handlers used in the unwinder. # sfp_machine_header The name of a sfp-machine.h header file for soft-fp. # Defaults to "$cpu_type/sfp-machine.h" if it exists,
[gcc r15-3801] aarch64: skip copy of RA state register into target context
https://gcc.gnu.org/g:ba3e597681b640f6f9a676ec4f6cd3ca3878cefc commit r15-3801-gba3e597681b640f6f9a676ec4f6cd3ca3878cefc Author: Matthieu Longo Date: Mon Sep 23 15:03:35 2024 +0100 aarch64: skip copy of RA state register into target context The RA state register is local to a frame, so it should not be copied to the target frame during the context installation. This patch adds a new backend handler that check whether a register needs to be skipped or not before its installation. libgcc/ChangeLog: * config/aarch64/aarch64-unwind.h (MD_FRAME_LOCAL_REGISTER_P): new handler checking whether a register from the current context needs to be skipped before installation into the target context. (aarch64_frame_local_register): Likewise. * unwind-dw2.c (uw_install_context_1): use MD_FRAME_LOCAL_REGISTER_P. Diff: --- libgcc/config/aarch64/aarch64-unwind.h | 11 +++ libgcc/unwind-dw2.c| 5 + 2 files changed, 16 insertions(+) diff --git a/libgcc/config/aarch64/aarch64-unwind.h b/libgcc/config/aarch64/aarch64-unwind.h index 94ea5891b4eb..52bfd5409798 100644 --- a/libgcc/config/aarch64/aarch64-unwind.h +++ b/libgcc/config/aarch64/aarch64-unwind.h @@ -53,6 +53,9 @@ typedef enum { #define MD_DEMANGLE_RETURN_ADDR(context, fs, addr) \ aarch64_demangle_return_addr (context, fs, addr) +#define MD_FRAME_LOCAL_REGISTER_P(reg) \ + aarch64_frame_local_register (reg) + static inline aarch64_ra_signing_method_t aarch64_context_ra_state_get (struct _Unwind_Context *context) { @@ -127,6 +130,14 @@ aarch64_arch_extension_frame_init (struct _Unwind_Context *context ATTRIBUTE_UNU aarch64_fs_ra_state_set (fs, aarch64_ra_no_signing); } +/* Before copying the current context to the target context, check whether + the register is local to this context and should not be forwarded. */ +static inline bool +aarch64_frame_local_register(long reg) +{ + return (reg == AARCH64_DWARF_REGNUM_RA_STATE); +} + /* Do AArch64 private extraction on ADDR_WORD based on context info CONTEXT and unwind frame info FS. If ADDR_WORD is signed, we do address authentication on it using CFA of current frame. diff --git a/libgcc/unwind-dw2.c b/libgcc/unwind-dw2.c index 40d64c0c0a39..5f33f80670ac 100644 --- a/libgcc/unwind-dw2.c +++ b/libgcc/unwind-dw2.c @@ -1423,6 +1423,11 @@ uw_install_context_1 (struct _Unwind_Context *current, void *c = (void *) (_Unwind_Internal_Ptr) current->reg[i]; void *t = (void *) (_Unwind_Internal_Ptr)target->reg[i]; +#ifdef MD_FRAME_LOCAL_REGISTER_P + if (MD_FRAME_LOCAL_REGISTER_P (i)) + continue; +#endif + gcc_assert (current->by_value[i] == 0); if (target->by_value[i] && c) {
[gcc r15-3804] dwarf2: add hooks for architecture-specific CFIs
https://gcc.gnu.org/g:9e1c71bab50d51a1a8ec1a75080ffde6ca3d854c commit r15-3804-g9e1c71bab50d51a1a8ec1a75080ffde6ca3d854c Author: Matthieu Longo Date: Mon Sep 23 15:34:57 2024 +0100 dwarf2: add hooks for architecture-specific CFIs Architecture-specific CFI directives are currently declared an processed among others architecture-independent CFI directives in gcc/dwarf2* files. This approach creates confusion, specifically in the case of DWARF instructions in the vendor space and using the same instruction code. Such a clash currently happen between DW_CFA_GNU_window_save (used on SPARC) and DW_CFA_AARCH64_negate_ra_state (used on AArch64), and both having the same instruction code 0x2d. Then AArch64 compilers generates a SPARC CFI directive (.cfi_window_save) instead of .cfi_negate_ra_state, contrarilly to what is expected in [DWARF for the Arm 64-bit Architecture (AArch64)](https://github.com/ ARM-software/abi-aa/blob/main/aadwarf64/aadwarf64.rst). This refactoring does not solve completely the problem, but improve the situation by moving some of the processing of those directives (more specifically their output in the assembly) to the backend via 2 target hooks: - DW_CFI_OPRND1_DESC: parse the first operand of the directive (if any). - OUTPUT_CFI_DIRECTIVE: output the CFI directive as a string. Additionally, this patch also contains a renaming of an enum used for return address mangling on AArch64. gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_output_cfi_directive): New hook for CFI directives. (aarch64_dw_cfi_oprnd1_desc): Same. (TARGET_OUTPUT_CFI_DIRECTIVE): Hook for output_cfi_directive. (TARGET_DW_CFI_OPRND1_DESC): Hook for dw_cfi_oprnd1_desc. * config/sparc/sparc.cc (sparc_output_cfi_directive): New hook for CFI directives. (sparc_dw_cfi_oprnd1_desc): Same. (TARGET_OUTPUT_CFI_DIRECTIVE): Hook for output_cfi_directive. (TARGET_DW_CFI_OPRND1_DESC): Hook for dw_cfi_oprnd1_desc. * coretypes.h (struct dw_cfi_node): Forward declaration of CFI type from gcc/dwarf2out.h. (enum dw_cfi_oprnd_type): Same. (enum dwarf_call_frame_info): Same. * doc/tm.texi: Regenerated from doc/tm.texi.in. * doc/tm.texi.in: Add doc for new target hooks. type of enum to allow forward declaration. * dwarf2cfi.cc (struct dw_cfi_row): Update the description for window_save and ra_mangled. (dwarf2out_frame_debug_cfa_negate_ra_state): Use AArch64 CFI directive instead of the SPARC one. (change_cfi_row): Use the right CFI directive's name for RA mangling. (output_cfi): Remove explicit architecture-specific CFI directive DW_CFA_GNU_window_save that falls into default case. (output_cfi_directive): Use target hook as default. * dwarf2out.cc (dw_cfi_oprnd1_desc): Use target hook as default. * dwarf2out.h (enum dw_cfi_oprnd_type): specify underlying type of enum to allow forward declaration. (dw_cfi_oprnd1_desc): Call target hook. (output_cfi_directive): Use dw_cfi_ref instead of struct dw_cfi_node *. * hooks.cc (hook_bool_dwcfi_dwcfioprndtyperef_false): New. (hook_bool_FILEptr_dwcfiptr_false): New. * hooks.h (hook_bool_dwcfi_dwcfioprndtyperef_false): New. (hook_bool_FILEptr_dwcfiptr_false): New. * target.def: Documentation for new hooks. include/ChangeLog: * dwarf2.h (enum dwarf_call_frame_info): specify underlying libffi/ChangeLog: * include/ffi_cfi.h (cfi_negate_ra_state): Declare AArch64 cfi directive. libgcc/ChangeLog: * config/aarch64/aarch64-asm.h (PACIASP): Replace SPARC CFI directive by AArch64 one. (AUTIASP): Same. libitm/ChangeLog: * config/aarch64/sjlj.S: Replace SPARC CFI directive by AArch64 one. gcc/testsuite/ChangeLog: * g++.target/aarch64/pr94515-1.C: Replace SPARC CFI directive by AArch64 one. * g++.target/aarch64/pr94515-2.C: Same. Diff: --- gcc/config/aarch64/aarch64.cc| 33 ++ gcc/config/sparc/sparc.cc| 35 gcc/coretypes.h | 6 + gcc/doc/tm.texi | 16 - gcc/doc/tm.texi.in | 5 +++- gcc/dwarf2cfi.cc | 31 gcc/dwarf2out.cc | 13 +++ gcc/dwarf2o
[gcc r15-3803] Rename REG_CFA_TOGGLE_RA_MANGLE to REG_CFA_NEGATE_RA_STATE
https://gcc.gnu.org/g:4068096fbf5aef65883a7492f4940cea85b39f40 commit r15-3803-g4068096fbf5aef65883a7492f4940cea85b39f40 Author: Matthieu Longo Date: Mon Sep 23 15:31:18 2024 +0100 Rename REG_CFA_TOGGLE_RA_MANGLE to REG_CFA_NEGATE_RA_STATE The current name REG_CFA_TOGGLE_RA_MANGLE is not representative of what it really is, i.e. a register to represent several states, not only a binary one. Same for dwarf2out_frame_debug_cfa_toggle_ra_mangle. gcc/ChangeLog: * combine-stack-adj.cc (no_unhandled_cfa): Rename. * config/aarch64/aarch64.cc (aarch64_expand_prologue): Rename. (aarch64_expand_epilogue): Rename. * dwarf2cfi.cc (dwarf2out_frame_debug_cfa_toggle_ra_mangle): Rename this... (dwarf2out_frame_debug_cfa_negate_ra_state): To this. (dwarf2out_frame_debug): Rename. * reg-notes.def (REG_CFA_NOTE): Rename REG_CFA_TOGGLE_RA_MANGLE. Diff: --- gcc/combine-stack-adj.cc | 2 +- gcc/config/aarch64/aarch64.cc | 4 ++-- gcc/dwarf2cfi.cc | 8 gcc/reg-notes.def | 8 4 files changed, 11 insertions(+), 11 deletions(-) diff --git a/gcc/combine-stack-adj.cc b/gcc/combine-stack-adj.cc index 2da9bf2bc1ef..367d3b66b749 100644 --- a/gcc/combine-stack-adj.cc +++ b/gcc/combine-stack-adj.cc @@ -212,7 +212,7 @@ no_unhandled_cfa (rtx_insn *insn) case REG_CFA_SET_VDRAP: case REG_CFA_WINDOW_SAVE: case REG_CFA_FLUSH_QUEUE: - case REG_CFA_TOGGLE_RA_MANGLE: + case REG_CFA_NEGATE_RA_STATE: return false; } diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 68913beaee20..e41431d56ac4 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -9612,7 +9612,7 @@ aarch64_expand_prologue (void) default: gcc_unreachable (); } - add_reg_note (insn, REG_CFA_TOGGLE_RA_MANGLE, const0_rtx); + add_reg_note (insn, REG_CFA_NEGATE_RA_STATE, const0_rtx); RTX_FRAME_RELATED_P (insn) = 1; } @@ -10033,7 +10033,7 @@ aarch64_expand_epilogue (rtx_call_insn *sibcall) default: gcc_unreachable (); } - add_reg_note (insn, REG_CFA_TOGGLE_RA_MANGLE, const0_rtx); + add_reg_note (insn, REG_CFA_NEGATE_RA_STATE, const0_rtx); RTX_FRAME_RELATED_P (insn) = 1; } diff --git a/gcc/dwarf2cfi.cc b/gcc/dwarf2cfi.cc index 1231b5bb5f05..4ad9acbd6fd6 100644 --- a/gcc/dwarf2cfi.cc +++ b/gcc/dwarf2cfi.cc @@ -1547,13 +1547,13 @@ dwarf2out_frame_debug_cfa_window_save (void) cur_row->window_save = true; } -/* A subroutine of dwarf2out_frame_debug, process a REG_CFA_TOGGLE_RA_MANGLE. +/* A subroutine of dwarf2out_frame_debug, process a REG_CFA_NEGATE_RA_STATE. Note: DW_CFA_GNU_window_save dwarf opcode is reused for toggling RA mangle state, this is a target specific operation on AArch64 and can only be used on other targets if they don't use the window save operation otherwise. */ static void -dwarf2out_frame_debug_cfa_toggle_ra_mangle (void) +dwarf2out_frame_debug_cfa_negate_ra_state (void) { dw_cfi_ref cfi = new_cfi (); @@ -2341,8 +2341,8 @@ dwarf2out_frame_debug (rtx_insn *insn) handled_one = true; break; - case REG_CFA_TOGGLE_RA_MANGLE: - dwarf2out_frame_debug_cfa_toggle_ra_mangle (); + case REG_CFA_NEGATE_RA_STATE: + dwarf2out_frame_debug_cfa_negate_ra_state (); handled_one = true; break; diff --git a/gcc/reg-notes.def b/gcc/reg-notes.def index 5b878fb2a1cd..ddcf16b68be5 100644 --- a/gcc/reg-notes.def +++ b/gcc/reg-notes.def @@ -180,10 +180,10 @@ REG_CFA_NOTE (CFA_WINDOW_SAVE) the rest of the compiler as a CALL_INSN. */ REG_CFA_NOTE (CFA_FLUSH_QUEUE) -/* Attached to insns that are RTX_FRAME_RELATED_P, toggling the mangling status - of return address. Currently it's only used by AArch64. The argument is - ignored. */ -REG_CFA_NOTE (CFA_TOGGLE_RA_MANGLE) +/* Attached to insns that are RTX_FRAME_RELATED_P, indicating an authentication + of the return address. Currently it's only used by AArch64. + The argument is ignored. */ +REG_CFA_NOTE (CFA_NEGATE_RA_STATE) /* Indicates what exception region an INSN belongs in. This is used to indicate what region to which a call may throw. REGION 0
[gcc r15-3805] aarch64 testsuite: explain expectections for pr94515* tests
https://gcc.gnu.org/g:fb475d3f25943beffac8e9c0c78247bad75287a1 commit r15-3805-gfb475d3f25943beffac8e9c0c78247bad75287a1 Author: Matthieu Longo Date: Mon Sep 23 15:35:02 2024 +0100 aarch64 testsuite: explain expectections for pr94515* tests gcc/testsuite/ChangeLog: * g++.target/aarch64/pr94515-1.C: Improve test documentation. * g++.target/aarch64/pr94515-2.C: Same. Diff: --- gcc/testsuite/g++.target/aarch64/pr94515-1.C | 8 ++ gcc/testsuite/g++.target/aarch64/pr94515-2.C | 39 +++- 2 files changed, 41 insertions(+), 6 deletions(-) diff --git a/gcc/testsuite/g++.target/aarch64/pr94515-1.C b/gcc/testsuite/g++.target/aarch64/pr94515-1.C index d5c114a83a82..359039e17536 100644 --- a/gcc/testsuite/g++.target/aarch64/pr94515-1.C +++ b/gcc/testsuite/g++.target/aarch64/pr94515-1.C @@ -15,12 +15,20 @@ void unwind (void) __attribute__((noinline, noipa, target("branch-protection=pac-ret"))) int test (int z) { + // paciasp -> cfi_negate_ra_state: RA_no_signing -> RA_signing_SP if (z) { asm volatile ("":::"x20","x21"); unwind (); +// autiasp -> cfi_negate_ra_state: RA_signing_SP -> RA_no_signing return 1; } else { +// 2nd cfi_negate_ra_state because the CFI directives are processed linearily. +// At this point, the unwinder would believe that the address is not signed +// due to the previous return. That's why the compiler has to emit second +// cfi_negate_ra_state to mean that the return address is still signed. +// cfi_negate_ra_state: RA_no_signing -> RA_signing_SP unwind (); +// autiasp -> cfi_negate_ra_state: RA_signing_SP -> RA_no_signing return 2; } } diff --git a/gcc/testsuite/g++.target/aarch64/pr94515-2.C b/gcc/testsuite/g++.target/aarch64/pr94515-2.C index f4abeed4..bdb65411a080 100644 --- a/gcc/testsuite/g++.target/aarch64/pr94515-2.C +++ b/gcc/testsuite/g++.target/aarch64/pr94515-2.C @@ -6,6 +6,7 @@ volatile int zero = 0; int global = 0; +/* This is a leaf function, so no .cfi_negate_ra_state directive is expected. */ __attribute__((noinline)) int bar(void) { @@ -13,29 +14,55 @@ int bar(void) return 0; } +/* This function does not return normally, so the address is signed but no + * authentication code is emitted. It means that only one CFI directive is + * supposed to be emitted at signing time. */ __attribute__((noinline, noreturn)) void unwind (void) { throw 42; } +/* This function has several return instructions, and alternates different RA + * states. 4 .cfi_negate_ra_state and a .cfi_remember_state/.cfi_restore_state + * should be emitted. + * + * Expected layout: + * A: path to return 0 without assignment to global + * B: global=y + branch back into A + * C: return 2 + * D: unwind + * Which gives with return pointer authentication: + * A: sign -> authenticate [2 negate_ra_states + remember_state for B] + * B: signed [restore_state] + * C: unsigned [negate_ra_state] + * D: signed [negate_ra_state] + */ __attribute__((noinline, noipa)) int test(int x) { - if (x==1) return 2; /* This return path may not use the stack. */ + // This return path may not use the stack. This means that the return address + // won't be signed. + if (x==1) return 2; + + // All the return paths of the code below must have RA mangle state set, and + // the return address must be signed. int y = bar(); if (y > global) global=y; - if (y==3) unwind(); /* This return path must have RA mangle state set. */ - return 0; + if (y==3) unwind(); // authentication of the return address is not required. + return 0; // authentication of the return address is required. } +/* This function requires only 2 .cfi_negate_ra_state. */ int main () { + // paciasp -> cfi_negate_ra_state: RA_no_signing -> RA_signing_SP try { test (zero); -__builtin_abort (); +__builtin_abort (); // authentication of the return address is not required. } catch (...) { +// autiasp -> cfi_negate_ra_state: RA_signing_SP -> RA_no_signing return 0; } - __builtin_abort (); -} + __builtin_abort (); // authentication of the return address is not required. +} \ No newline at end of file
[gcc r15-3806] dwarf2: store the RA state in CFI row
https://gcc.gnu.org/g:2b7971448f122317ed012586f9f73ccc0537deb2 commit r15-3806-g2b7971448f122317ed012586f9f73ccc0537deb2 Author: Matthieu Longo Date: Mon Sep 23 15:35:07 2024 +0100 dwarf2: store the RA state in CFI row On AArch64, the RA state informs the unwinder whether the return address is mangled and how, or not. This information is encoded in a boolean in the CFI row. This binary approach prevents from expressing more complex configuration, as it is the case with PAuth_LR introduced in Armv9.5-A. This patch addresses this limitation by replacing the boolean by an enum. gcc/ChangeLog: * dwarf2cfi.cc (struct dw_cfi_row): Declare a new enum type to replace ra_mangled. (cfi_row_equal_p): Use ra_state instead of ra_mangled. (dwarf2out_frame_debug_cfa_negate_ra_state): Same. (change_cfi_row): Same. Diff: --- gcc/dwarf2cfi.cc | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/gcc/dwarf2cfi.cc b/gcc/dwarf2cfi.cc index f8d19d524299..1b94185a4966 100644 --- a/gcc/dwarf2cfi.cc +++ b/gcc/dwarf2cfi.cc @@ -57,6 +57,15 @@ along with GCC; see the file COPYING3. If not see #define DEFAULT_INCOMING_FRAME_SP_OFFSET INCOMING_FRAME_SP_OFFSET #endif + +/* Signing method used for return address authentication. + (AArch64 extension) */ +typedef enum +{ + ra_no_signing = 0x0, + ra_signing_sp = 0x1, +} ra_signing_method_t; + /* A collected description of an entire row of the abstract CFI table. */ struct GTY(()) dw_cfi_row { @@ -74,8 +83,8 @@ struct GTY(()) dw_cfi_row bool window_save; /* AArch64 extension for DW_CFA_AARCH64_negate_ra_state. - True if the return address is in a mangled state. */ - bool ra_mangled; + Enum which stores the return address state. */ + ra_signing_method_t ra_state; }; /* The caller's ORIG_REG is saved in SAVED_IN_REG. */ @@ -857,7 +866,7 @@ cfi_row_equal_p (dw_cfi_row *a, dw_cfi_row *b) if (a->window_save != b->window_save) return false; - if (a->ra_mangled != b->ra_mangled) + if (a->ra_state != b->ra_state) return false; return true; @@ -1554,8 +1563,11 @@ dwarf2out_frame_debug_cfa_negate_ra_state (void) { dw_cfi_ref cfi = new_cfi (); cfi->dw_cfi_opc = DW_CFA_AARCH64_negate_ra_state; + cur_row->ra_state += (cur_row->ra_state == ra_no_signing + ? ra_signing_sp + : ra_no_signing); add_cfi (cfi); - cur_row->ra_mangled = !cur_row->ra_mangled; } /* Record call frame debugging information for an expression EXPR, @@ -2412,12 +2424,12 @@ change_cfi_row (dw_cfi_row *old_row, dw_cfi_row *new_row) { dw_cfi_ref cfi = new_cfi (); - gcc_assert (!old_row->ra_mangled && !new_row->ra_mangled); + gcc_assert (!old_row->ra_state && !new_row->ra_state); cfi->dw_cfi_opc = DW_CFA_GNU_window_save; add_cfi (cfi); } - if (old_row->ra_mangled != new_row->ra_mangled) + if (old_row->ra_state != new_row->ra_state) { dw_cfi_ref cfi = new_cfi ();
[gcc r15-3738] testsuite: Update commandline for PR116628.c to use neoverse-v2 [PR116628]
https://gcc.gnu.org/g:0189ab205aa86b8e67ae982294f0fe58aa9c4774 commit r15-3738-g0189ab205aa86b8e67ae982294f0fe58aa9c4774 Author: Tamar Christina Date: Fri Sep 20 17:01:39 2024 +0100 testsuite: Update commandline for PR116628.c to use neoverse-v2 [PR116628] The testcase for this tests needs Neoverse V2 to be used since due to costing the other cost models don't pick this particular SVE mode. committed as obvious. Thanks, Tamar gcc/testsuite/ChangeLog: PR tree-optimization/116628 * gcc.dg/vect/pr116628.c: Update cmdline. Diff: --- gcc/testsuite/gcc.dg/vect/pr116628.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr116628.c b/gcc/testsuite/gcc.dg/vect/pr116628.c index 4068c657ac55..a38ffb33365a 100644 --- a/gcc/testsuite/gcc.dg/vect/pr116628.c +++ b/gcc/testsuite/gcc.dg/vect/pr116628.c @@ -1,7 +1,7 @@ /* { dg-do compile } */ /* { dg-require-effective-target vect_float } */ /* { dg-require-effective-target vect_masked_store } */ -/* { dg-additional-options "-Ofast -march=armv9-a" { target aarch64-*-* } } */ +/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" { target aarch64-*-* } } */ typedef float c; c a[2000], b[0];
[gcc r15-3739] AArch64: Define VECTOR_STORE_FLAG_VALUE.
https://gcc.gnu.org/g:33cb400b2e7266e65030869254366217e51494aa commit r15-3739-g33cb400b2e7266e65030869254366217e51494aa Author: Tamar Christina Date: Fri Sep 20 17:03:54 2024 +0100 AArch64: Define VECTOR_STORE_FLAG_VALUE. This defines VECTOR_STORE_FLAG_VALUE to CONST1_RTX for AArch64 so we simplify vector comparisons in AArch64. With this enabled res: moviv0.4s, 0 cmeqv0.4s, v0.4s, v0.4s ret is simplified to: res: mvniv0.4s, 0 ret gcc/ChangeLog: * config/aarch64/aarch64.h (VECTOR_STORE_FLAG_VALUE): New. gcc/testsuite/ChangeLog: * gcc.dg/rtl/aarch64/vector-eq.c: New test. Diff: --- gcc/config/aarch64/aarch64.h | 10 ++ gcc/testsuite/gcc.dg/rtl/aarch64/vector-eq.c | 29 2 files changed, 39 insertions(+) diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h index 2dfb999bea53..a99e7bb6c477 100644 --- a/gcc/config/aarch64/aarch64.h +++ b/gcc/config/aarch64/aarch64.h @@ -156,6 +156,16 @@ #define PCC_BITFIELD_TYPE_MATTERS 1 +/* Use the same RTL truth representation for vector elements as we do + for scalars. This maintains the property that a comparison like + eq:V4SI is a composition of 4 individual eq:SIs, just like plus:V4SI + is a composition of 4 individual plus:SIs. + + This means that Advanced SIMD comparisons are represented in RTL as + (neg (op ...)). */ + +#define VECTOR_STORE_FLAG_VALUE(MODE) CONST1_RTX (GET_MODE_INNER (MODE)) + #ifndef USED_FOR_TARGET /* Define an enum of all features (ISA modes, architectures and extensions). diff --git a/gcc/testsuite/gcc.dg/rtl/aarch64/vector-eq.c b/gcc/testsuite/gcc.dg/rtl/aarch64/vector-eq.c new file mode 100644 index ..8e0d7773620c --- /dev/null +++ b/gcc/testsuite/gcc.dg/rtl/aarch64/vector-eq.c @@ -0,0 +1,29 @@ +/* { dg-do compile { target aarch64-*-* } } */ +/* { dg-additional-options "-O2" } */ +/* { dg-final { check-function-bodies "**" "" "" } } */ + +/* +** foo: +** mvniv0.4s, 0 +** ret +*/ +__Uint32x4_t __RTL (startwith ("vregs")) foo (void) +{ +(function "foo" + (insn-chain +(block 2 + (edge-from entry (flags "FALLTHRU")) + (cnote 1 [bb 2] NOTE_INSN_BASIC_BLOCK) + (cnote 2 NOTE_INSN_FUNCTION_BEG) + (cinsn 3 (set (reg:V4SI <0>) (const_vector:V4SI [(const_int 0) (const_int 0) (const_int 0) (const_int 0)]))) + (cinsn 4 (set (reg:V4SI <1>) (reg:V4SI <0>))) + (cinsn 5 (set (reg:V4SI <2>) + (neg:V4SI (eq:V4SI (reg:V4SI <0>) (reg:V4SI <1>) + (cinsn 6 (set (reg:V4SI v0) (reg:V4SI <2>))) + (edge-to exit (flags "FALLTHRU")) +) + ) + (crtl (return_rtx (reg/i:V4SI v0))) +) +} +
[gcc r15-3959] middle-end: check explicitly for external or constants when checking for loop invariant [PR116817]
https://gcc.gnu.org/g:87905f63a6521eef1f38082e2368e18c637ef092 commit r15-3959-g87905f63a6521eef1f38082e2368e18c637ef092 Author: Tamar Christina Date: Mon Sep 30 13:06:24 2024 +0100 middle-end: check explicitly for external or constants when checking for loop invariant [PR116817] The previous check if a value was external was checking !vect_get_internal_def (vinfo, var) but this of course isn't completely right as they could reductions etc. This changes the check to just explicitly look at externals and constants. Note that reductions remain unhandled here, but we don't support codegen of boolean reductions today anyway. So at the time we do then this would have the be handled as well in lowering. gcc/ChangeLog: PR tree-optimization/116817 * tree-vect-patterns.cc (vect_recog_bool_pattern): Check for const or externals. gcc/testsuite/ChangeLog: PR tree-optimization/116817 * g++.dg/vect/pr116817.cc: New test. Diff: --- gcc/testsuite/g++.dg/vect/pr116817.cc | 16 gcc/tree-vect-patterns.cc | 5 - 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/g++.dg/vect/pr116817.cc b/gcc/testsuite/g++.dg/vect/pr116817.cc new file mode 100644 index ..7e28982fb138 --- /dev/null +++ b/gcc/testsuite/g++.dg/vect/pr116817.cc @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O3" } */ + +int main_ulData0; +unsigned *main_pSrcBuffer; +int main(void) { + int iSrc = 0; + bool bData0; + for (; iSrc < 4; iSrc++) { +if (bData0) + main_pSrcBuffer[iSrc] = main_ulData0; +else + main_pSrcBuffer[iSrc] = 0; +bData0 = !bData0; + } +} diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index e7e877dd2adb..b174ff1e705c 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -6062,12 +6062,15 @@ vect_recog_bool_pattern (vec_info *vinfo, if (get_vectype_for_scalar_type (vinfo, type) == NULL_TREE) return NULL; + enum vect_def_type dt; if (check_bool_pattern (var, vinfo, bool_stmts)) var = adjust_bool_stmts (vinfo, bool_stmts, type, stmt_vinfo); else if (integer_type_for_mask (var, vinfo)) return NULL; else if (TREE_CODE (TREE_TYPE (var)) == BOOLEAN_TYPE - && !vect_get_internal_def (vinfo, var)) + && vect_is_simple_use (var, vinfo, &dt) + && (dt == vect_external_def + || dt == vect_constant_def)) { /* If the condition is already a boolean then manually convert it to a mask of the given integer type but don't set a vectype. */
[gcc r14-10893] AArch64: rename the SVE2 psel intrinsics to psel_lane [PR116371]
https://gcc.gnu.org/g:97640e9632697b9f0ab31e4022d24d360d1ea2c9 commit r14-10893-g97640e9632697b9f0ab31e4022d24d360d1ea2c9 Author: Tamar Christina Date: Mon Oct 14 13:58:09 2024 +0100 AArch64: rename the SVE2 psel intrinsics to psel_lane [PR116371] The psel intrinsics. similar to the pext, should be name psel_lane. This corrects the naming. gcc/ChangeLog: PR target/116371 * config/aarch64/aarch64-sve-builtins-sve2.cc (class svpsel_impl): Renamed to ... (class svpsel_lane_impl): ... This and adjust initialization. * config/aarch64/aarch64-sve-builtins-sve2.def (svpsel): Renamed to ... (svpsel_lane): ... This. * config/aarch64/aarch64-sve-builtins-sve2.h (svpsel): Renamed to svpsel_lane. gcc/testsuite/ChangeLog: PR target/116371 * gcc.target/aarch64/sme2/acle-asm/psel_b16.c, gcc.target/aarch64/sme2/acle-asm/psel_b32.c, gcc.target/aarch64/sme2/acle-asm/psel_b64.c, gcc.target/aarch64/sme2/acle-asm/psel_b8.c, gcc.target/aarch64/sme2/acle-asm/psel_c16.c, gcc.target/aarch64/sme2/acle-asm/psel_c32.c, gcc.target/aarch64/sme2/acle-asm/psel_c64.c, gcc.target/aarch64/sme2/acle-asm/psel_c8.c: Renamed to * gcc.target/aarch64/sme2/acle-asm/psel_lane_b16.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_b32.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_b64.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_b8.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_c16.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_c32.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_c64.c, gcc.target/aarch64/sme2/acle-asm/psel_lane_c8.c: ... These. (cherry picked from commit 306834b7f74ab61160f205e04f5bf35b71f9ec52) Diff: --- gcc/config/aarch64/aarch64-sve-builtins-sve2.cc| 4 +- gcc/config/aarch64/aarch64-sve-builtins-sve2.def | 2 +- gcc/config/aarch64/aarch64-sve-builtins-sve2.h | 2 +- .../gcc.target/aarch64/sme2/acle-asm/psel_b16.c| 89 -- .../gcc.target/aarch64/sme2/acle-asm/psel_b32.c| 89 -- .../gcc.target/aarch64/sme2/acle-asm/psel_b64.c| 80 --- .../gcc.target/aarch64/sme2/acle-asm/psel_b8.c | 89 -- .../gcc.target/aarch64/sme2/acle-asm/psel_c16.c| 89 -- .../gcc.target/aarch64/sme2/acle-asm/psel_c32.c| 89 -- .../gcc.target/aarch64/sme2/acle-asm/psel_c64.c| 80 --- .../gcc.target/aarch64/sme2/acle-asm/psel_c8.c | 89 -- .../aarch64/sme2/acle-asm/psel_lane_b16.c | 89 ++ .../aarch64/sme2/acle-asm/psel_lane_b32.c | 89 ++ .../aarch64/sme2/acle-asm/psel_lane_b64.c | 80 +++ .../aarch64/sme2/acle-asm/psel_lane_b8.c | 89 ++ .../aarch64/sme2/acle-asm/psel_lane_c16.c | 89 ++ .../aarch64/sme2/acle-asm/psel_lane_c32.c | 89 ++ .../aarch64/sme2/acle-asm/psel_lane_c64.c | 80 +++ .../aarch64/sme2/acle-asm/psel_lane_c8.c | 89 ++ 19 files changed, 698 insertions(+), 698 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc index 4f25cc680282..06d4d22fc0b2 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc @@ -234,7 +234,7 @@ public: } }; -class svpsel_impl : public function_base +class svpsel_lane_impl : public function_base { public: rtx @@ -625,7 +625,7 @@ FUNCTION (svpmullb, unspec_based_function, (-1, UNSPEC_PMULLB, -1)) FUNCTION (svpmullb_pair, unspec_based_function, (-1, UNSPEC_PMULLB_PAIR, -1)) FUNCTION (svpmullt, unspec_based_function, (-1, UNSPEC_PMULLT, -1)) FUNCTION (svpmullt_pair, unspec_based_function, (-1, UNSPEC_PMULLT_PAIR, -1)) -FUNCTION (svpsel, svpsel_impl,) +FUNCTION (svpsel_lane, svpsel_lane_impl,) FUNCTION (svqabs, rtx_code_function, (SS_ABS, UNKNOWN, UNKNOWN)) FUNCTION (svqcadd, svqcadd_impl,) FUNCTION (svqcvt, integer_conversion, (UNSPEC_SQCVT, UNSPEC_SQCVTU, diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.def b/gcc/config/aarch64/aarch64-sve-builtins-sve2.def index 4366925a9711..ef677a74020b 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.def +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.def @@ -235,7 +235,7 @@ DEF_SVE_FUNCTION (svsm4ekey, binary, s_unsigned, none) | AARCH64_FL_SME \ | AARCH64_FL_SM_ON) DEF_SVE_FUNCTION (svclamp, clamp, all_integer, none) -DEF_SVE_FUNCTION (svpsel, select_pred, all_pred_count, none) +DEF_SVE_FU
[gcc r15-5791] AArch64: Suppress default options when march or mcpu used is not affected by it.
https://gcc.gnu.org/g:5b0e4ed3081e6648460661ff5013e9f03e318505 commit r15-5791-g5b0e4ed3081e6648460661ff5013e9f03e318505 Author: Tamar Christina Date: Fri Nov 29 13:01:11 2024 + AArch64: Suppress default options when march or mcpu used is not affected by it. This patch makes it so that when you use any of the Cortex-A53 errata workarounds but have specified an -march or -mcpu we know is not affected by it that we suppress the errata workaround. This is a driver only patch as the linker invocation needs to be changed as well. The linker and cc SPECs are different because for the linker we didn't seem to add an inversion flag for the option. That said, it's also not possible to configure the linker with it on by default. So not passing the flag is sufficient to turn it off. For the compilers however we have an inversion flag using -mno-, which is needed to disable the workarounds when the compiler has been configured with it by default. In case it's unclear how the patch does what it does (it took me a while to figure out the syntax): * Early matching will replace any -march=native or -mcpu=native with their expanded forms and erases the native arguments from the buffer. * Due to the above if we ensure we handle the new code after this erasure then we only have to handle the expanded form. * The expanded form needs to handle -march=+extensions and -mcpu=+extensions and so we can't use normal string matching but instead use strstr with a custom driver function that's common between native and non-native builds. * For the compilers we output -mno- and for the linker we just erase the --fix- option. * The extra internal matching, e.g. the duplicate match of mcpu inside: mcpu=*:%{%:is_local_not_armv8_base(%{mcpu=*:%*}) is so we can extract the glob using %* because the outer match would otherwise reset at the %{. The reason for the outer glob at all is to skip the block early if no matches are found. The workaround has the effect of suppressing certain inlining and multiply-add formation which leads to about ~1% SPECCPU 2017 Intrate regression on modern cores. This patch is needed because most distros configure GCC with the workaround enabled by default. Expected output: > gcc -mcpu=neoverse-v1 -mfix-cortex-a53-835769 -xc - -O3 -o - < /dev/null -### 2>&1 | grep "\-mfix" | wc -l 0 > gcc -mfix-cortex-a53-835769 -xc - -O3 -o - < /dev/null -### 2>&1 | grep "\-mfix" | wc -l 5 > gcc -mfix-cortex-a53-835769 -march=armv8-a -xc - -O3 -o - < /dev/null -### 2>&1 | grep "\-mfix" | wc -l 5 > gcc -mfix-cortex-a53-835769 -march=armv8.1-a -xc - -O3 -o - < /dev/null -### 2>&1 | grep "\-mfix" | wc -l 0 > gcc -mfix-cortex-a53-835769 -march=armv8.1-a -xc - -O3 -o - < /dev/null -### 2>&1 | grep "\-\-fix" | wc -l 0 > gcc -mfix-cortex-a53-835769 -march=armv8-a -xc - -O3 -o - < /dev/null -### 2>&1 | grep "\-\-fix" | wc -l 1 > -gcc -mfix-cortex-a53-835769 -xc - -O3 -o - < /dev/null -### 2>&1 | grep "\-\-fix" | wc -l 1 gcc/ChangeLog: * config/aarch64/aarch64-errata.h (TARGET_SUPPRESS_OPT_SPEC, TARGET_TURN_OFF_OPT_SPEC, CA53_ERR_835769_COMPILE_SPEC, CA53_ERR_843419_COMPILE_SPEC): New. (CA53_ERR_835769_SPEC, CA53_ERR_843419_SPEC): Use them. * config/aarch64/aarch64-elf-raw.h (CC1_SPEC, CC1PLUS_SPEC): Add AARCH64_ERRATA_COMPILE_SPEC. * config/aarch64/aarch64-freebsd.h (CC1_SPEC, CC1PLUS_SPEC): Likewise. * config/aarch64/aarch64-gnu.h (CC1_SPEC, CC1PLUS_SPEC): Likewise. * config/aarch64/aarch64-linux.h (CC1_SPEC, CC1PLUS_SPEC): Likewise. * config/aarch64/aarch64-netbsd.h (CC1_SPEC, CC1PLUS_SPEC): Likewise. * common/config/aarch64/aarch64-common.cc (is_host_cpu_not_armv8_base): New. * config/aarch64/driver-aarch64.cc: Remove extra newline * config/aarch64/aarch64.h (is_host_cpu_not_armv8_base): New. (MCPU_TO_MARCH_SPEC_FUNCTIONS): Add is_local_not_armv8_base. (EXTRA_SPEC_FUNCTIONS): Add is_local_cpu_armv8_base. * doc/invoke.texi: Document it. gcc/testsuite/ChangeLog: * gcc.target/aarch64/cpunative/info_30: New test. * gcc.target/aarch64/cpunative/info_31: New test. * gcc.target/aarch64/cpunative/info_32: New test. * gcc.target/aarch64/cpunative/info_33: New test. * gcc.target/aarch64/cpunative/native_cpu_30.c: New test. * gcc.target/aarch64/cpunative/native_cpu_31.c: New test. * gcc.target/aarch64/cpunative/native_cpu_32.c: New test. * gcc.target/aarch64/cpunative/native_cpu_33.c: New test.
[gcc r15-5585] middle-end:For multiplication try swapping operands when matching complex multiply [PR116463]
https://gcc.gnu.org/g:a9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8 commit r15-5585-ga9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8 Author: Tamar Christina Date: Thu Nov 21 15:10:24 2024 + middle-end:For multiplication try swapping operands when matching complex multiply [PR116463] This commit fixes the failures of complex.exp=fast-math-complex-mls-*.c on the GCC 14 branch and some of the ones on the master. The current matching just looks for one order for multiplication and was relying on canonicalization to always give the right order because of the TWO_OPERANDS. However when it comes to the multiplication trying only one order is a bit fragile as they can be flipped. The failing tests on the branch are: void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= a[i] * (b[i] * I * I); } void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= (a[i] * I * I) * b[i]; } The issue is just a small difference in commutative operations. we look for {R,R} * {R,I} but found {R,I} * {R,R}. Since the DF analysis is cached, we should be able to swap operands and retry for multiply cheaply. There is a constraint being checked by vect_validate_multiplication for the data flow of the operands feeding the multiplications. So e.g. between the nodes: note: node 0x4d1d210 (max_nunits=2, refcnt=3) vector(2) double note: op template: _27 = _10 * _25; note: stmt 0 _27 = _10 * _25; note: stmt 1 _29 = _11 * _25; note: node 0x4d1d060 (max_nunits=2, refcnt=2) vector(2) double note: op template: _26 = _11 * _24; note: stmt 0 _26 = _11 * _24; note: stmt 1 _28 = _10 * _24; we require the lanes to come from the same source which vect_validate_multiplication checks. As such it doesn't make sense to flip them individually because that would invalidate the earlier linear_loads_p checks which have validated that the arguments all come from the same datarefs. This patch thus flips the operands in unison to still maintain this invariant, but also honor the commutative nature of multiplication. gcc/ChangeLog: PR tree-optimization/116463 * tree-vect-slp-patterns.cc (complex_mul_pattern::matches, complex_fms_pattern::matches): Try swapping operands on multiply. Diff: --- gcc/tree-vect-slp-patterns.cc | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/gcc/tree-vect-slp-patterns.cc b/gcc/tree-vect-slp-patterns.cc index d62682be43c9..2535d46db3e8 100644 --- a/gcc/tree-vect-slp-patterns.cc +++ b/gcc/tree-vect-slp-patterns.cc @@ -1076,7 +1076,15 @@ complex_mul_pattern::matches (complex_operation_t op, enum _conj_status status; if (!vect_validate_multiplication (perm_cache, compat_cache, left_op, right_op, false, &status)) -return IFN_LAST; +{ + /* Try swapping the order and re-trying since multiplication is +commutative. */ + std::swap (left_op[0], left_op[1]); + std::swap (right_op[0], right_op[1]); + if (!vect_validate_multiplication (perm_cache, compat_cache, left_op, +right_op, false, &status)) + return IFN_LAST; +} if (status == CONJ_NONE) { @@ -1293,7 +1301,15 @@ complex_fms_pattern::matches (complex_operation_t op, enum _conj_status status; if (!vect_validate_multiplication (perm_cache, compat_cache, right_op, left_op, true, &status)) -return IFN_LAST; +{ + /* Try swapping the order and re-trying since multiplication is +commutative. */ + std::swap (left_op[0], left_op[1]); + std::swap (right_op[0], right_op[1]); + if (!vect_validate_multiplication (perm_cache, compat_cache, right_op, +left_op, true, &status)) + return IFN_LAST; +} if (status == CONJ_NONE) ifn = IFN_COMPLEX_FMS;
[gcc r15-5745] middle-end: rework vectorizable_store to iterate over single index [PR117557]
https://gcc.gnu.org/g:1b3bff737b2d5a7dc0d5977b77200c785fc53f01 commit r15-5745-g1b3bff737b2d5a7dc0d5977b77200c785fc53f01 Author: Tamar Christina Date: Thu Nov 28 10:23:14 2024 + middle-end: rework vectorizable_store to iterate over single index [PR117557] The testcase #include #include #define N 8 #define L 8 void f(const uint8_t * restrict seq1, const uint8_t *idx, uint8_t *seq_out) { for (int i = 0; i < L; ++i) { uint8_t h = idx[i]; memcpy((void *)&seq_out[i * N], (const void *)&seq1[h * N / 2], N / 2); } } compiled at -O3 -mcpu=neoverse-n1+sve miscompiles to: ld1wz31.s, p3/z, [x23, z29.s, sxtw] ld1wz29.s, p7/z, [x23, z30.s, sxtw] st1wz29.s, p7, [x24, z12.s, sxtw] st1wz31.s, p7, [x24, z12.s, sxtw] rather than ld1wz31.s, p3/z, [x23, z29.s, sxtw] ld1wz29.s, p7/z, [x23, z30.s, sxtw] st1wz29.s, p7, [x24, z12.s, sxtw] addvl x3, x24, #2 st1wz31.s, p3, [x3, z12.s, sxtw] Where two things go wrong, the wrong mask is used and the address pointers to the stores are wrong. This issue is happening because the codegen loop in vectorizable_store is a nested loop where in the outer loop we iterate over ncopies and in the inner loop we loop over vec_num. For SLP ncopies == 1 and vec_num == SLP_NUM_STMS, but the loop mask is determined by only the outerloop index and the pointer address is only updated in the outer loop. As such for SLP we always use the same predicate and the same memory location. This patch flattens the two loops and instead iterates over ncopies * vec_num and simplified the indexing. This does not fully fix the gcc_r miscompile error in SPECCPU 2017 as the error moves somewhere else. I will look at that next but fixes some other libraries that also started failing. gcc/ChangeLog: PR tree-optimization/117557 * tree-vect-stmts.cc (vectorizable_store): Flatten the ncopies and vec_num loops. gcc/testsuite/ChangeLog: PR tree-optimization/117557 * gcc.target/aarch64/pr117557.c: New test. Diff: --- gcc/testsuite/gcc.target/aarch64/pr117557.c | 29 ++ gcc/tree-vect-stmts.cc | 504 ++-- 2 files changed, 281 insertions(+), 252 deletions(-) diff --git a/gcc/testsuite/gcc.target/aarch64/pr117557.c b/gcc/testsuite/gcc.target/aarch64/pr117557.c new file mode 100644 index ..80b3fde41109 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr117557.c @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-O3 -mcpu=neoverse-n1+sve -fdump-tree-vect" } */ +/* { dg-final { check-function-bodies "**" "" } } */ + +#include +#include + +#define N 8 +#define L 8 + +/* +**f: +** ... +** ld1wz[0-9]+.s, p([0-9]+)/z, \[x[0-9]+, z[0-9]+.s, sxtw\] +** ld1wz[0-9]+.s, p([0-9]+)/z, \[x[0-9]+, z[0-9]+.s, sxtw\] +** st1wz[0-9]+.s, p\1, \[x[0-9]+, z[0-9]+.s, sxtw\] +** incbx([0-9]+), all, mul #2 +** st1wz[0-9]+.s, p\2, \[x\3, z[0-9]+.s, sxtw\] +** ret +** ... +*/ +void f(const uint8_t * restrict seq1, + const uint8_t *idx, uint8_t *seq_out) { + for (int i = 0; i < L; ++i) { +uint8_t h = idx[i]; +memcpy((void *)&seq_out[i * N], (const void *)&seq1[h * N / 2], N / 2); + } +} + diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index c2d5818b2786..4759c274f3cc 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -9228,7 +9228,8 @@ vectorizable_store (vec_info *vinfo, gcc_assert (!grouped_store); auto_vec vec_offsets; unsigned int inside_cost = 0, prologue_cost = 0; - for (j = 0; j < ncopies; j++) + int num_stmts = ncopies * vec_num; + for (j = 0; j < num_stmts; j++) { gimple *new_stmt; if (j == 0) @@ -9246,14 +9247,14 @@ vectorizable_store (vec_info *vinfo, vect_get_slp_defs (op_node, gvec_oprnds[0]); else vect_get_vec_defs_for_operand (vinfo, first_stmt_info, - ncopies, op, gvec_oprnds[0]); + num_stmts, op, gvec_oprnds[0]); if (mask) { if (slp_node) vect_get_slp_defs (mask_node, &vec_masks); else vect_get_vec_defs_for_operand (vinfo, stmt_info, - ncopies, + num_stmts, mask, &vec_masks, mask_vectype);
[gcc r14-11053] middle-end:For multiplication try swapping operands when matching complex multiply [PR116463]
https://gcc.gnu.org/g:f01f01f0ebf8f5207096cb9650354210d890fe0d commit r14-11053-gf01f01f0ebf8f5207096cb9650354210d890fe0d Author: Tamar Christina Date: Thu Nov 21 15:10:24 2024 + middle-end:For multiplication try swapping operands when matching complex multiply [PR116463] This commit fixes the failures of complex.exp=fast-math-complex-mls-*.c on the GCC 14 branch and some of the ones on the master. The current matching just looks for one order for multiplication and was relying on canonicalization to always give the right order because of the TWO_OPERANDS. However when it comes to the multiplication trying only one order is a bit fragile as they can be flipped. The failing tests on the branch are: void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= a[i] * (b[i] * I * I); } void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= (a[i] * I * I) * b[i]; } The issue is just a small difference in commutative operations. we look for {R,R} * {R,I} but found {R,I} * {R,R}. Since the DF analysis is cached, we should be able to swap operands and retry for multiply cheaply. There is a constraint being checked by vect_validate_multiplication for the data flow of the operands feeding the multiplications. So e.g. between the nodes: note: node 0x4d1d210 (max_nunits=2, refcnt=3) vector(2) double note: op template: _27 = _10 * _25; note: stmt 0 _27 = _10 * _25; note: stmt 1 _29 = _11 * _25; note: node 0x4d1d060 (max_nunits=2, refcnt=2) vector(2) double note: op template: _26 = _11 * _24; note: stmt 0 _26 = _11 * _24; note: stmt 1 _28 = _10 * _24; we require the lanes to come from the same source which vect_validate_multiplication checks. As such it doesn't make sense to flip them individually because that would invalidate the earlier linear_loads_p checks which have validated that the arguments all come from the same datarefs. This patch thus flips the operands in unison to still maintain this invariant, but also honor the commutative nature of multiplication. gcc/ChangeLog: PR tree-optimization/116463 * tree-vect-slp-patterns.cc (complex_mul_pattern::matches, complex_fms_pattern::matches): Try swapping operands on multiply. (cherry picked from commit a9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8) Diff: --- gcc/tree-vect-slp-patterns.cc | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/gcc/tree-vect-slp-patterns.cc b/gcc/tree-vect-slp-patterns.cc index 4a582ec9512e..3bb283a3b5b4 100644 --- a/gcc/tree-vect-slp-patterns.cc +++ b/gcc/tree-vect-slp-patterns.cc @@ -1069,7 +1069,15 @@ complex_mul_pattern::matches (complex_operation_t op, enum _conj_status status; if (!vect_validate_multiplication (perm_cache, compat_cache, left_op, right_op, false, &status)) -return IFN_LAST; +{ + /* Try swapping the order and re-trying since multiplication is +commutative. */ + std::swap (left_op[0], left_op[1]); + std::swap (right_op[0], right_op[1]); + if (!vect_validate_multiplication (perm_cache, compat_cache, left_op, +right_op, false, &status)) + return IFN_LAST; +} if (status == CONJ_NONE) { @@ -1286,7 +1294,15 @@ complex_fms_pattern::matches (complex_operation_t op, enum _conj_status status; if (!vect_validate_multiplication (perm_cache, compat_cache, right_op, left_op, true, &status)) -return IFN_LAST; +{ + /* Try swapping the order and re-trying since multiplication is +commutative. */ + std::swap (left_op[0], left_op[1]); + std::swap (right_op[0], right_op[1]); + if (!vect_validate_multiplication (perm_cache, compat_cache, right_op, +left_op, true, &status)) + return IFN_LAST; +} if (status == CONJ_NONE) ifn = IFN_COMPLEX_FMS;
[gcc r15-6654] cfgexpand: Factor out getting the stack decl index
https://gcc.gnu.org/g:4b1a2878ba3241ec5c0a1bf05ff47bfcd09c3711 commit r15-6654-g4b1a2878ba3241ec5c0a1bf05ff47bfcd09c3711 Author: Andrew Pinski Date: Fri Nov 15 20:22:02 2024 -0800 cfgexpand: Factor out getting the stack decl index This is the first patch in improving this code. Since there are a few places which get the index and they check the same thing, let's factor that out into one function. Bootstrapped and tested on x86_64-linux-gnu. gcc/ChangeLog: * cfgexpand.cc (INVALID_STACK_INDEX): New defined. (decl_stack_index): New function. (visit_op): Use decl_stack_index. (visit_conflict): Likewise. (add_scope_conflicts_1): Likewise. Signed-off-by: Andrew Pinski Diff: --- gcc/cfgexpand.cc | 62 +--- 1 file changed, 37 insertions(+), 25 deletions(-) diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc index abab385293a5..cdebb00cd792 100644 --- a/gcc/cfgexpand.cc +++ b/gcc/cfgexpand.cc @@ -337,6 +337,8 @@ static unsigned stack_vars_alloc; static unsigned stack_vars_num; static hash_map *decl_to_stack_part; +#define INVALID_STACK_INDEX ((unsigned)-1) + /* Conflict bitmaps go on this obstack. This allows us to destroy all of them in one big sweep. */ static bitmap_obstack stack_var_bitmap_obstack; @@ -525,6 +527,27 @@ stack_var_conflict_p (unsigned x, unsigned y) return bitmap_bit_p (a->conflicts, y); } +/* Returns the DECL's index into the stack_vars array. + If the DECL does not exist return INVALID_STACK_INDEX. */ +static unsigned +decl_stack_index (tree decl) +{ + if (!decl) +return INVALID_STACK_INDEX; + if (!DECL_P (decl)) +return INVALID_STACK_INDEX; + if (DECL_RTL_IF_SET (decl) != pc_rtx) +return INVALID_STACK_INDEX; + unsigned *v = decl_to_stack_part->get (decl); + if (!v) +return INVALID_STACK_INDEX; + + unsigned indx = *v; + gcc_checking_assert (indx != INVALID_STACK_INDEX); + gcc_checking_assert (indx < stack_vars_num); + return indx; +} + /* Callback for walk_stmt_ops. If OP is a decl touched by add_stack_var enter its partition number into bitmap DATA. */ @@ -533,14 +556,9 @@ visit_op (gimple *, tree op, tree, void *data) { bitmap active = (bitmap)data; op = get_base_address (op); - if (op - && DECL_P (op) - && DECL_RTL_IF_SET (op) == pc_rtx) -{ - unsigned *v = decl_to_stack_part->get (op); - if (v) - bitmap_set_bit (active, *v); -} + unsigned idx = decl_stack_index (op); + if (idx != INVALID_STACK_INDEX) +bitmap_set_bit (active, idx); return false; } @@ -553,20 +571,15 @@ visit_conflict (gimple *, tree op, tree, void *data) { bitmap active = (bitmap)data; op = get_base_address (op); - if (op - && DECL_P (op) - && DECL_RTL_IF_SET (op) == pc_rtx) + unsigned num = decl_stack_index (op); + if (num != INVALID_STACK_INDEX + && bitmap_set_bit (active, num)) { - unsigned *v = decl_to_stack_part->get (op); - if (v && bitmap_set_bit (active, *v)) - { - unsigned num = *v; - bitmap_iterator bi; - unsigned i; - gcc_assert (num < stack_vars_num); - EXECUTE_IF_SET_IN_BITMAP (active, 0, i, bi) - add_stack_var_conflict (num, i); - } + bitmap_iterator bi; + unsigned i; + gcc_assert (num < stack_vars_num); + EXECUTE_IF_SET_IN_BITMAP (active, 0, i, bi) + add_stack_var_conflict (num, i); } return false; } @@ -638,15 +651,14 @@ add_scope_conflicts_1 (basic_block bb, bitmap work, bool for_conflict) if (gimple_clobber_p (stmt)) { tree lhs = gimple_assign_lhs (stmt); - unsigned *v; /* Handle only plain var clobbers. Nested functions lowering and C++ front-end inserts clobbers which are not just plain variables. */ if (!VAR_P (lhs)) continue; - if (DECL_RTL_IF_SET (lhs) == pc_rtx - && (v = decl_to_stack_part->get (lhs))) - bitmap_clear_bit (work, *v); + unsigned indx = decl_stack_index (lhs); + if (indx != INVALID_STACK_INDEX) + bitmap_clear_bit (work, indx); } else if (!is_gimple_debug (stmt)) {
[gcc r15-6656] cfgexpand: Handle integral vector types and constructors for scope conflicts [PR105769]
https://gcc.gnu.org/g:4f4722b0722ec343df70e5ec5fd9d5c682ff8149 commit r15-6656-g4f4722b0722ec343df70e5ec5fd9d5c682ff8149 Author: Andrew Pinski Date: Fri Nov 15 20:22:04 2024 -0800 cfgexpand: Handle integral vector types and constructors for scope conflicts [PR105769] This is an expansion of the last patch to also track pointers via vector types and the constructor that are used with vector types. In this case we had: ``` _15 = (long unsigned int) &bias; _10 = (long unsigned int) &cov_jn; _12 = {_10, _15}; ... MEM[(struct vec *)&cov_jn] ={v} {CLOBBER(bob)}; bias ={v} {CLOBBER(bob)}; MEM[(struct function *)&D.6156] ={v} {CLOBBER(bob)}; ... MEM [(void *)&D.6172 + 32B] = _12; MEM[(struct function *)&D.6157] ={v} {CLOBBER(bob)}; ``` Anyways tracking the pointers via vector types to say they are alive at the point where the store of the vector happens fixes the bug by saying it is alive at the same time as another variable is alive. Bootstrapped and tested on x86_64-linux-gnu. PR tree-optimization/105769 gcc/ChangeLog: * cfgexpand.cc (vars_ssa_cache::operator()): For constructors walk over the elements. gcc/testsuite/ChangeLog: * g++.dg/torture/pr105769-1.C: New test. Signed-off-by: Andrew Pinski Diff: --- gcc/cfgexpand.cc | 20 +++-- gcc/testsuite/g++.dg/torture/pr105769-1.C | 67 +++ 2 files changed, 83 insertions(+), 4 deletions(-) diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc index f6c9f7755a4c..2b27076658fd 100644 --- a/gcc/cfgexpand.cc +++ b/gcc/cfgexpand.cc @@ -728,7 +728,7 @@ vars_ssa_cache::operator() (tree name) gcc_assert (TREE_CODE (name) == SSA_NAME); if (!POINTER_TYPE_P (TREE_TYPE (name)) - && !INTEGRAL_TYPE_P (TREE_TYPE (name))) + && !ANY_INTEGRAL_TYPE_P (TREE_TYPE (name))) return empty; if (exists (name)) @@ -758,7 +758,7 @@ vars_ssa_cache::operator() (tree name) continue; if (!POINTER_TYPE_P (TREE_TYPE (use)) - && !INTEGRAL_TYPE_P (TREE_TYPE (use))) + && !ANY_INTEGRAL_TYPE_P (TREE_TYPE (use))) continue; /* Mark the old ssa name needs to be update from the use. */ @@ -772,10 +772,22 @@ vars_ssa_cache::operator() (tree name) so we don't go into an infinite loop for some phi nodes with loops. */ create (use); + gimple *g = SSA_NAME_DEF_STMT (use); + + /* CONSTRUCTOR here is always a vector initialization, +walk each element too. */ + if (gimple_assign_single_p (g) + && TREE_CODE (gimple_assign_rhs1 (g)) == CONSTRUCTOR) + { + tree ctr = gimple_assign_rhs1 (g); + unsigned i; + tree elm; + FOR_EACH_CONSTRUCTOR_VALUE (CONSTRUCTOR_ELTS (ctr), i, elm) + work_list.safe_push (std::make_pair (elm, use)); + } /* For assignments, walk each operand for possible addresses. For PHI nodes, walk each argument. */ - gimple *g = SSA_NAME_DEF_STMT (use); - if (gassign *a = dyn_cast (g)) + else if (gassign *a = dyn_cast (g)) { /* operand 0 is the lhs. */ for (unsigned i = 1; i < gimple_num_ops (g); i++) diff --git a/gcc/testsuite/g++.dg/torture/pr105769-1.C b/gcc/testsuite/g++.dg/torture/pr105769-1.C new file mode 100644 index ..3fe973656b84 --- /dev/null +++ b/gcc/testsuite/g++.dg/torture/pr105769-1.C @@ -0,0 +1,67 @@ +// { dg-do run } + +// PR tree-optimization/105769 + +// The partitioning code would incorrectly have bias +// and a temporary in the same partitioning because +// it was thought bias was not alive when those were alive +// do to vectorization of a store of pointers (that included bias). + +#include + +template +struct vec { + T dat[n]; + vec() {} + explicit vec(const T& x) { for(size_t i = 0; i < n; i++) dat[i] = x; } + T& operator [](size_t i) { return dat[i]; } + const T& operator [](size_t i) const { return dat[i]; } +}; + +template +using mat = vec>; +template +using sq_mat = mat; +using map_t = std::function; +template +using est_t = std::function; +template using est2_t = std::function; +map_t id_map() { return [](size_t j) -> size_t { return j; }; } + +template +est2_t jacknife(const est_t> est, sq_mat& cov, vec& bias) { + return [est, &cov, &bias](map_t map) -> void + { +bias = est(map); +for(size_t i = 0; i < n; i++) +{ + bias[i].print(); +} + }; +} + +template +void print_cov_ratio() { + sq_mat<2, T> cov_jn; + vec<2, T> bias; + jacknife<2, T>([](map_t map) -> vec<2, T> { vec<2, T> retv; retv[0] = 1; retv[1] = 1; return retv; }, cov_jn, bias)(id_map()); +} +struct ab { + long long unsigned a; + short unsigned b; + double operator()() { return a; } + ab& operator=(double rhs) { a = rhs; return *this; } + void print(); +};
[gcc r15-6655] cfgexpand: Rewrite add_scope_conflicts_2 to use cache and look back further [PR111422]
https://gcc.gnu.org/g:0014a858a14b825818d6b557c3d5193f85790bde commit r15-6655-g0014a858a14b825818d6b557c3d5193f85790bde Author: Andrew Pinski Date: Fri Nov 15 20:22:03 2024 -0800 cfgexpand: Rewrite add_scope_conflicts_2 to use cache and look back further [PR111422] After fixing loop-im to do the correct overflow rewriting for pointer types too. We end up with code like: ``` _9 = (unsigned long) &g; _84 = _9 + 18446744073709551615; _11 = _42 + _84; _44 = (signed char *) _11; ... *_44 = 10; g ={v} {CLOBBER(eos)}; ... n[0] = &f; *_44 = 8; g ={v} {CLOBBER(eos)}; ``` Which was not being recongized by the scope conflicts code. This was because it only handled one level walk backs rather than multiple ones. This fixes the issue by having a cache which records all references to addresses of stack variables. Unlike the previous patch, this only records and looks at addresses of stack variables. The cache uses a bitmap and uses the index as the bit to look at. PR middle-end/117426 PR middle-end/111422 gcc/ChangeLog: * cfgexpand.cc (struct vars_ssa_cache): New class. (vars_ssa_cache::vars_ssa_cache): New constructor. (vars_ssa_cache::~vars_ssa_cache): New deconstructor. (vars_ssa_cache::create): New method. (vars_ssa_cache::exists): New method. (vars_ssa_cache::add_one): New method. (vars_ssa_cache::update): New method. (vars_ssa_cache::dump): New method. (add_scope_conflicts_2): Factor mostly out to vars_ssa_cache::operator(). New cache argument. Walk the bitmap cache for the stack variables addresses. (vars_ssa_cache::operator()): New method factored out from add_scope_conflicts_2. Rewrite to be a full walk of all operands and use a worklist. (add_scope_conflicts_1): Add cache new argument for the addr cache. Just call add_scope_conflicts_2 for the phi result instead of calling for the uses and don't call walk_stmt_load_store_addr_ops for phis. Update call to add_scope_conflicts_2 to add cache argument. (add_scope_conflicts): Add cache argument and update calls to add_scope_conflicts_1. gcc/testsuite/ChangeLog: * gcc.dg/torture/pr117426-1.c: New test. Signed-off-by: Andrew Pinski Diff: --- gcc/cfgexpand.cc | 292 ++ gcc/testsuite/gcc.dg/torture/pr117426-1.c | 53 ++ 2 files changed, 308 insertions(+), 37 deletions(-) diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc index cdebb00cd792..f6c9f7755a4c 100644 --- a/gcc/cfgexpand.cc +++ b/gcc/cfgexpand.cc @@ -584,35 +584,243 @@ visit_conflict (gimple *, tree op, tree, void *data) return false; } -/* Helper function for add_scope_conflicts_1. For USE on - a stmt, if it is a SSA_NAME and in its SSA_NAME_DEF_STMT is known to be - based on some ADDR_EXPR, invoke VISIT on that ADDR_EXPR. */ +/* A cache for ssa name to address of stack variables. + When taking into account if a ssa name refers to an + address of a stack variable, we need to walk the + expressions backwards to find the addresses. This + cache is there so we don't need to walk the expressions + all the time. */ +struct vars_ssa_cache +{ +private: + /* Currently an entry is a bitmap of all of the known stack variables + addresses that are referenced by the ssa name. + When the bitmap is the nullptr, then there is no cache. + Currently only empty bitmaps are shared. + The reason for why empty cache is not just a null is so we know the + cache for an entry is filled in. */ + struct entry + { +bitmap bmap = nullptr; + }; + entry *vars_ssa_caches; +public: -static inline void -add_scope_conflicts_2 (tree use, bitmap work, - walk_stmt_load_store_addr_fn visit) + vars_ssa_cache(); + ~vars_ssa_cache(); + const_bitmap operator() (tree name); + void dump (FILE *file); + +private: + /* Can't copy. */ + vars_ssa_cache(const vars_ssa_cache&) = delete; + vars_ssa_cache(vars_ssa_cache&&) = delete; + + /* The shared empty bitmap. */ + bitmap empty; + + /* Unshare the index, currently only need + to unshare if the entry was empty. */ + void unshare(int indx) + { +if (vars_ssa_caches[indx].bmap == empty) + vars_ssa_caches[indx].bmap = BITMAP_ALLOC (&stack_var_bitmap_obstack); + } + void create (tree); + bool exists (tree use); + void add_one (tree old_name, unsigned); + bool update (tree old_name, tree use); +}; + +/* Constructor of the cache, create the cache array. */ +vars_ssa_cache::vars_ssa_cache () +{ + vars_ssa_caches = new entry[num_ssa_names]{}; + + /* Create the shared empty bitmap too. */ + empty = BITMAP_ALLOC (&stack_var_bitmap_
[gcc r15-6657] perform affine fold to unsigned on non address expressions. [PR114932]
https://gcc.gnu.org/g:405c99c17210a58df1a237219e773e689f17 commit r15-6657-g405c99c17210a58df1a237219e773e689f17 Author: Tamar Christina Date: Mon Jan 6 17:52:14 2025 + perform affine fold to unsigned on non address expressions. [PR114932] When the patch for PR114074 was applied we saw a good boost in exchange2. This boost was partially caused by a simplification of the addressing modes. With the patch applied IV opts saw the following form for the base addressing; Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) * 324) + 36) vs what we normally get: Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D) * 81) + 9) * 4 This is because the patch promoted multiplies where one operand is a constant from a signed multiply to an unsigned one, to attempt to fold away the constant. This patch attempts the same but due to the various problems with SCEV and niters not being able to analyze the resulting forms (i.e. PR114322) we can't do it during SCEV or in the general form like in fold-const like extract_muldiv attempts. Instead this applies the simplification during IVopts initialization when we create the IV. This allows IV opts to see the simplified form without influencing the rest of the compiler. as mentioned in PR114074 it would be good to fix the missed optimization in the other passes so we can perform this in general. The reason this has a big impact on Fortran code is that Fortran doesn't seem to have unsigned integer types. As such all it's addressing are created with signed types and folding does not happen on them due to the possible overflow. concretely on AArch64 this changes the results from generation: mov x27, -108 mov x24, -72 mov x23, -36 add x21, x1, x0, lsl 2 add x19, x20, x22 .L5: add x0, x22, x19 add x19, x19, 324 ldr d1, [x0, x27] add v1.2s, v1.2s, v15.2s str d1, [x20, 216] ldr d0, [x0, x24] add v0.2s, v0.2s, v15.2s str d0, [x20, 252] ldr d31, [x0, x23] add v31.2s, v31.2s, v15.2s str d31, [x20, 288] bl digits_20_ cmp x21, x19 bne .L5 into: .L5: ldr d1, [x19, -108] add v1.2s, v1.2s, v15.2s str d1, [x20, 216] ldr d0, [x19, -72] add v0.2s, v0.2s, v15.2s str d0, [x20, 252] ldr d31, [x19, -36] add x19, x19, 324 add v31.2s, v31.2s, v15.2s str d31, [x20, 288] bl digits_20_ cmp x21, x19 bne .L5 The two patches together results in a 10% performance increase in exchange2 in SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in compile time. There's also a 5% performance improvement in fotonik3d and similar reduction in binary size. The patch folds every IV to unsigned to canonicalize them. At the end of the pass we match.pd will then remove unneeded conversions. Note that we cannot force everything to unsigned, IVops requires that array address expressions remain as such. Folding them results in them becoming pointer expressions for which some optimizations in IVopts do not run. gcc/ChangeLog: PR tree-optimization/114932 * tree-ssa-loop-ivopts.cc (alloc_iv): Perform affine unsigned fold. gcc/testsuite/ChangeLog: PR tree-optimization/114932 * gcc.dg/tree-ssa/pr64705.c: Update dump file scan. * gcc.target/i386/pr115462.c: The testcase shares 3 IVs which calculates the same thing but with a slightly different increment offset. The test checks for 3 complex addressing loads, one for each IV. But with this change they now all share one IV. That is the loop now only has one complex addressing. This is ultimately driven by the backend costing and the current costing says this is preferred so updating the testcase. * gfortran.dg/addressing-modes_1.f90: New test. Diff: --- gcc/testsuite/gcc.dg/tree-ssa/pr64705.c | 2 +- gcc/testsuite/gcc.target/i386/pr115462.c | 2 +- gcc/testsuite/gfortran.dg/addressing-modes_1.f90 | 37 gcc/tree-ssa-loop-ivopts.cc | 20 ++--- 4 files changed, 49 insertions(+), 12 deletions(-) diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c b/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c index fd24e38a53e9..3c9c2e5deed1 100644
[gcc r15-6597] AArch64: Implement four and eight chunk VLA concats [PR118272]
https://gcc.gnu.org/g:830bead4859cd00da87e1304ba249cf0d3eb5a5a commit r15-6597-g830bead4859cd00da87e1304ba249cf0d3eb5a5a Author: Tamar Christina Date: Mon Jan 6 09:24:36 2025 + AArch64: Implement four and eight chunk VLA concats [PR118272] The following testcase #pragma GCC target ("+sve") extern char __attribute__ ((simd, const)) fn3 (int, short); void test_fn3 (float *a, float *b, double *c, int n) { for (int i = 0; i < n; ++i) a[i] = fn3 (b[i], c[i]); } at -Ofast ICEs because my previous patch only added support for combining 2 partial SVE vectors into a bigger vector. However There can also 4 and 8 piece subvectors. This patch fixes this by implementing the missing expansions. gcc/ChangeLog: PR target/96342 PR target/118272 * config/aarch64/aarch64-sve.md (vec_init, vec_initvnx16qivnx2qi): New. * config/aarch64/aarch64.cc (aarch64_sve_expand_vector_init_subvector): Rewrite to support any arbitrary combinations. * config/aarch64/iterators.md (SVE_NO2E): Update to use SVE_NO4E (SVE_NO2E, Vquad): New. gcc/testsuite/ChangeLog: PR target/96342 PR target/118272 * gcc.target/aarch64/vect-simd-clone-3.c: New test. Diff: --- gcc/config/aarch64/aarch64-sve.md | 23 gcc/config/aarch64/aarch64.cc | 42 +- gcc/config/aarch64/iterators.md| 12 +-- .../gcc.target/aarch64/vect-simd-clone-3.c | 27 ++ 4 files changed, 93 insertions(+), 11 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index 6b65d4eae2f2..ba4b4d904c77 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -2839,6 +2839,7 @@ } ) +;; Vector constructor combining two half vectors { a, b } (define_expand "vec_init" [(match_operand:SVE_NO2E 0 "register_operand") (match_operand 1 "")] @@ -2849,6 +2850,28 @@ } ) +;; Vector constructor combining four quad vectors { a, b, c, d } +(define_expand "vec_init" + [(match_operand:SVE_NO4E 0 "register_operand") + (match_operand 1 "")] + "TARGET_SVE" + { +aarch64_sve_expand_vector_init_subvector (operands[0], operands[1]); +DONE; + } +) + +;; Vector constructor combining eight vectors { a, b, c, d, ... } +(define_expand "vec_initvnx16qivnx2qi" + [(match_operand:VNx16QI 0 "register_operand") + (match_operand 1 "")] + "TARGET_SVE" + { +aarch64_sve_expand_vector_init_subvector (operands[0], operands[1]); +DONE; + } +) + ;; Shift an SVE vector left and insert a scalar into element 0. (define_insn "vec_shl_insert_" [(set (match_operand:SVE_FULL 0 "register_operand") diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 916a00ce3325..9e69bc744499 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -24879,18 +24879,42 @@ aarch64_sve_expand_vector_init_subvector (rtx target, rtx vals) machine_mode mode = GET_MODE (target); int nelts = XVECLEN (vals, 0); - gcc_assert (nelts == 2); + gcc_assert (nelts % 2 == 0); - rtx arg0 = XVECEXP (vals, 0, 0); - rtx arg1 = XVECEXP (vals, 0, 1); - - /* If we have two elements and are concatting vector. */ - machine_mode elem_mode = GET_MODE (arg0); + /* We have to be concatting vector. */ + machine_mode elem_mode = GET_MODE (XVECEXP (vals, 0, 0)); gcc_assert (VECTOR_MODE_P (elem_mode)); - arg0 = force_reg (elem_mode, arg0); - arg1 = force_reg (elem_mode, arg1); - emit_insn (gen_aarch64_pack_partial (mode, target, arg0, arg1)); + auto_vec worklist; + machine_mode wider_mode = elem_mode; + + for (int i = 0; i < nelts; i++) +worklist.safe_push (force_reg (elem_mode, XVECEXP (vals, 0, i))); + + /* Keep widening pairwise to have maximum throughput. */ + while (nelts >= 2) +{ + wider_mode + = related_vector_mode (wider_mode, GET_MODE_INNER (wider_mode), + GET_MODE_NUNITS (wider_mode) * 2).require (); + + for (int i = 0; i < nelts; i += 2) + { + rtx arg0 = worklist[i]; + rtx arg1 = worklist[i+1]; + gcc_assert (GET_MODE (arg0) == GET_MODE (arg1)); + + rtx tmp = gen_reg_rtx (wider_mode); + emit_insn (gen_aarch64_pack_partial (wider_mode, tmp, arg0, arg1)); + worklist[i / 2] = tmp; + } + + nelts /= 2; +} + + gcc_assert (wider_mode == mode); + emit_move_insn (target, worklist[0]); + return; } diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md index 7c9bc89d0ddd..ff0f34dd0430 100644 --- a/gcc/config/aarch64/iterators.md +++ b/gcc/config/aarch64/iterators.md @@ -140,9 +140,12 @@ ;; VQ without 2 element modes. (define_mode_iterator VQ_NO2E [V16QI V8HI V
[gcc r15-7453] testsuite: Fix two testisms on x86 after PFA [PR118754]
https://gcc.gnu.org/g:aaf5f5027d3f29c6c0d836753dddac16ba94a49a commit r15-7453-gaaf5f5027d3f29c6c0d836753dddac16ba94a49a Author: Tamar Christina Date: Mon Feb 10 09:32:29 2025 + testsuite: Fix two testisms on x86 after PFA [PR118754] These two tests now vectorize the result finding loop with PFA and so the number of loops checked fails. This fixes them by adding #pragma GCC novector to the testcases. gcc/testsuite/ChangeLog: PR testsuite/118754 * gcc.dg/vect/vect-tail-nomask-1.c: Add novector. * gcc.target/i386/pr106010-8c.c: Likewise. Diff: --- gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c | 2 ++ gcc/testsuite/gcc.target/i386/pr106010-8c.c| 1 + 2 files changed, 3 insertions(+) diff --git a/gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c b/gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c index ee9ab2e9d910..116a7aefca6c 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c +++ b/gcc/testsuite/gcc.dg/vect/vect-tail-nomask-1.c @@ -72,6 +72,7 @@ run_test () init_data (a, b, c, SIZE); test_citer (a, b, c); +#pragma GCC novector for (i = 0; i < SIZE; i++) if (c[i] != a[i] + b[i]) __builtin_abort (); @@ -80,6 +81,7 @@ run_test () init_data (a, b, c, SIZE); test_viter (a, b, c, SIZE); +#pragma GCC novector for (i = 0; i < SIZE; i++) if (c[i] != a[i] + b[i]) __builtin_abort (); diff --git a/gcc/testsuite/gcc.target/i386/pr106010-8c.c b/gcc/testsuite/gcc.target/i386/pr106010-8c.c index 61ae131829dc..76a3b3cbb628 100644 --- a/gcc/testsuite/gcc.target/i386/pr106010-8c.c +++ b/gcc/testsuite/gcc.target/i386/pr106010-8c.c @@ -30,6 +30,7 @@ do_test (void) __builtin_memset (ph_dst, 0, 2 * N * sizeof (_Float16)); foo_ph (ph_dst); +#pragma GCC novector for (int i = 0; i != N; i++) { if (ph_dst[i] != ph_src)
[gcc r15-7395] middle-end: Remove unused internal function after IVopts cleanup [PR118756]
https://gcc.gnu.org/g:8d19fbb2be487f19ed1c48699e17cafe19520525 commit r15-7395-g8d19fbb2be487f19ed1c48699e17cafe19520525 Author: Tamar Christina Date: Thu Feb 6 17:46:52 2025 + middle-end: Remove unused internal function after IVopts cleanup [PR118756] It seems that after my IVopts patches the function contain_complex_addr_expr became unused and clang is rightfully complaining about it. This removes the unused internal function. gcc/ChangeLog: PR tree-optimization/118756 * tree-ssa-loop-ivopts.cc (contain_complex_addr_expr): Remove. Diff: --- gcc/tree-ssa-loop-ivopts.cc | 28 1 file changed, 28 deletions(-) diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc index 989321137df9..e37b24062f73 100644 --- a/gcc/tree-ssa-loop-ivopts.cc +++ b/gcc/tree-ssa-loop-ivopts.cc @@ -1149,34 +1149,6 @@ determine_base_object (struct ivopts_data *data, tree expr) return obj; } -/* Return true if address expression with non-DECL_P operand appears - in EXPR. */ - -static bool -contain_complex_addr_expr (tree expr) -{ - bool res = false; - - STRIP_NOPS (expr); - switch (TREE_CODE (expr)) -{ -case POINTER_PLUS_EXPR: -case PLUS_EXPR: -case MINUS_EXPR: - res |= contain_complex_addr_expr (TREE_OPERAND (expr, 0)); - res |= contain_complex_addr_expr (TREE_OPERAND (expr, 1)); - break; - -case ADDR_EXPR: - return (!DECL_P (TREE_OPERAND (expr, 0))); - -default: - return false; -} - - return res; -} - /* Allocates an induction variable with given initial value BASE and step STEP for loop LOOP. NO_OVERFLOW implies the iv doesn't overflow. */
[gcc r13-9373] AArch64: Fix GCC 13 backport of big.Little CPU detection [PR118800]
https://gcc.gnu.org/g:fa5aedd841105329b2f65cb0ff418cb4427f255e commit r13-9373-gfa5aedd841105329b2f65cb0ff418cb4427f255e Author: Tamar Christina Date: Wed Feb 12 10:38:21 2025 + AArch64: Fix GCC 13 backport of big.Little CPU detection [PR118800] On the GCC-13 branch the backport caused a failure due to the branch not having generic-armv8-a and also it still treating the generic cpu special. This made it return NULL when trying to find the default CPU. In GCC 13 we still had multiple structures with the same information and in this case aarch64_cpu_data was missing the generic CPU which is in all_cores. This corrects it by using "generc" instead and also adding it to aarch64_cpu_data. gcc/ChangeLog: PR target/118800 * config/aarch64/driver-aarch64.cc (DEFAULT_CPU): Use generic instead of generic-armv8-a. (aarch64_cpu_data): Add generic. gcc/testsuite/ChangeLog: PR target/118800 * gcc.target/aarch64/cpunative/native_cpu_34.c: Update order. Diff: --- gcc/config/aarch64/driver-aarch64.cc | 3 ++- gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_34.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/gcc/config/aarch64/driver-aarch64.cc b/gcc/config/aarch64/driver-aarch64.cc index ff4660f469cd..acc44536629e 100644 --- a/gcc/config/aarch64/driver-aarch64.cc +++ b/gcc/config/aarch64/driver-aarch64.cc @@ -60,7 +60,7 @@ struct aarch64_core_data #define ALL_VARIANTS ((unsigned)-1) /* Default architecture to use if -mcpu=native did not detect a known CPU. */ #define DEFAULT_ARCH "8A" -#define DEFAULT_CPU "generic-armv8-a" +#define DEFAULT_CPU "generic" #define AARCH64_CORE(CORE_NAME, CORE_IDENT, SCHED, ARCH, FLAGS, COSTS, IMP, PART, VARIANT) \ { CORE_NAME, #ARCH, IMP, PART, VARIANT, feature_deps::cpu_##CORE_IDENT }, @@ -68,6 +68,7 @@ struct aarch64_core_data static CONSTEXPR const aarch64_core_data aarch64_cpu_data[] = { #include "aarch64-cores.def" + { "generic", "armv8-a", 0, 0, ALL_VARIANTS, 0}, { NULL, NULL, INVALID_IMP, INVALID_CORE, ALL_VARIANTS, 0 } }; diff --git a/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_34.c b/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_34.c index 168140002a0f..d2ff8156d8fc 100644 --- a/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_34.c +++ b/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_34.c @@ -7,6 +7,6 @@ int main() return 0; } -/* { dg-final { scan-assembler {\.arch armv8-a\+dotprod\+crc\+crypto\+sve2\n} } } */ +/* { dg-final { scan-assembler {\.arch armv8-a\+crc\+dotprod\+crypto\+sve2\n} } } */ /* Test a normal looking procinfo. */
[gcc r15-6109] middle-end: Add initial support for poly_int64 BIT_FIELD_REF in expand pass [PR96342]
https://gcc.gnu.org/g:b6242bd122757ec6c75c73a4921f24a9a382b090 commit r15-6109-gb6242bd122757ec6c75c73a4921f24a9a382b090 Author: Victor Do Nascimento Date: Wed Dec 11 12:00:58 2024 + middle-end: Add initial support for poly_int64 BIT_FIELD_REF in expand pass [PR96342] While `poly_int64' has been the default representation of bitfield size and offset for some time, there was a lack of support for the use of non-constant `poly_int64' values for those values throughout the compiler, limiting the applicability of the BIT_FIELD_REF rtl expression for variable length vectors, such as those used by SVE. This patch starts work on extending the functionality of relevant functions in the expand pass such as to enable their use by the compiler for such vectors. gcc/ChangeLog: PR target/96342 * expr.cc (store_constructor): Enable poly_{u}int64 type usage. (get_inner_reference): Ditto. Co-authored-by: Tamar Christina Diff: --- gcc/expr.cc | 29 +++-- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/gcc/expr.cc b/gcc/expr.cc index 88fa56cb299d..babf00f34dcf 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -7901,15 +7901,14 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size, { unsigned HOST_WIDE_INT idx; constructor_elt *ce; - int i; bool need_to_clear; insn_code icode = CODE_FOR_nothing; tree elt; tree elttype = TREE_TYPE (type); int elt_size = vector_element_bits (type); machine_mode eltmode = TYPE_MODE (elttype); - HOST_WIDE_INT bitsize; - HOST_WIDE_INT bitpos; + poly_int64 bitsize; + poly_int64 bitpos; rtvec vector = NULL; poly_uint64 n_elts; unsigned HOST_WIDE_INT const_n_elts; @@ -8006,7 +8005,7 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size, ? TREE_TYPE (CONSTRUCTOR_ELT (exp, 0)->value) : elttype); if (VECTOR_TYPE_P (val_type)) - bitsize = tree_to_uhwi (TYPE_SIZE (val_type)); + bitsize = tree_to_poly_uint64 (TYPE_SIZE (val_type)); else bitsize = elt_size; @@ -8019,12 +8018,12 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size, need_to_clear = true; else { - unsigned HOST_WIDE_INT count = 0, zero_count = 0; + poly_uint64 count = 0, zero_count = 0; tree value; FOR_EACH_CONSTRUCTOR_VALUE (CONSTRUCTOR_ELTS (exp), idx, value) { - int n_elts_here = bitsize / elt_size; + poly_int64 n_elts_here = exact_div (bitsize, elt_size); count += n_elts_here; if (mostly_zeros_p (value)) zero_count += n_elts_here; @@ -8033,7 +8032,7 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size, /* Clear the entire vector first if there are any missing elements, or if the incidence of zero elements is >= 75%. */ need_to_clear = (maybe_lt (count, n_elts) -|| 4 * zero_count >= 3 * count); +|| maybe_gt (4 * zero_count, 3 * count)); } if (need_to_clear && maybe_gt (size, 0) && !vector) @@ -8060,9 +8059,13 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size, /* Store each element of the constructor into the corresponding element of TARGET, determined by counting the elements. */ - for (idx = 0, i = 0; -vec_safe_iterate (CONSTRUCTOR_ELTS (exp), idx, &ce); -idx++, i += bitsize / elt_size) + HOST_WIDE_INT chunk_size = 0; + bool chunk_multiple_p = constant_multiple_p (bitsize, elt_size, +&chunk_size); + gcc_assert (chunk_multiple_p || vec_vec_init_p); + + for (idx = 0; vec_safe_iterate (CONSTRUCTOR_ELTS (exp), idx, &ce); +idx++) { HOST_WIDE_INT eltpos; tree value = ce->value; @@ -8073,7 +8076,7 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size, if (ce->index) eltpos = tree_to_uhwi (ce->index); else - eltpos = i; + eltpos = idx * chunk_size; if (vector) { @@ -8461,10 +8464,8 @@ get_inner_reference (tree exp, poly_int64 *pbitsize, if (size_tree != 0) { - if (! tree_fits_uhwi_p (size_tree)) + if (!poly_int_tree_p (size_tree, pbitsize)) mode = BLKmode, *pbitsize = -1; - else - *pbitsize = tree_to_uhwi (size_tree); } *preversep = reverse_storage_order_for_component_p (exp);
[gcc r15-6108] middle-end: add vec_init support for variable length subvector concatenation. [PR96342]
https://gcc.gnu.org/g:d069eb91d5696a8642bd5fc44a6d47fd7f74d18b commit r15-6108-gd069eb91d5696a8642bd5fc44a6d47fd7f74d18b Author: Victor Do Nascimento Date: Wed Dec 11 12:00:09 2024 + middle-end: add vec_init support for variable length subvector concatenation. [PR96342] For architectures where the vector-length is a compile-time variable, rather representing a runtime constant, as is the case with SVE it is perfectly reasonable that such vector be made up of two (or more) subvector components of a compatible sub-length variable. One example of this would be the concatenation of two VNx4QI vectors into a single VNx8QI vector. This patch adds initial support for the enablement of this feature in the middle-end, removing the `.is_constant()' constraint on the vector's number of elements, instead making the constant no. of elements the multiple of the number of subvectors (which must then also be of variable length, such that their polynomial ratio then results in a compile-time constant) required to fill the vector. gcc/ChangeLog: PR target/96342 * expr.cc (store_constructor): add support for variable-length vectors. Co-authored-by: Tamar Christina Diff: --- gcc/expr.cc | 38 +++--- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/gcc/expr.cc b/gcc/expr.cc index 980ac415cfc7..88fa56cb299d 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -7966,12 +7966,9 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size, n_elts = TYPE_VECTOR_SUBPARTS (type); if (REG_P (target) - && VECTOR_MODE_P (mode) - && n_elts.is_constant (&const_n_elts)) + && VECTOR_MODE_P (mode)) { - machine_mode emode = eltmode; - bool vector_typed_elts_p = false; - + const_n_elts = 0; if (CONSTRUCTOR_NELTS (exp) && (TREE_CODE (TREE_TYPE (CONSTRUCTOR_ELT (exp, 0)->value)) == VECTOR_TYPE)) @@ -7980,23 +7977,26 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size, gcc_assert (known_eq (CONSTRUCTOR_NELTS (exp) * TYPE_VECTOR_SUBPARTS (etype), n_elts)); - emode = TYPE_MODE (etype); - vector_typed_elts_p = true; + + icode = convert_optab_handler (vec_init_optab, mode, + TYPE_MODE (etype)); + const_n_elts = CONSTRUCTOR_NELTS (exp); + vec_vec_init_p = icode != CODE_FOR_nothing; } - icode = convert_optab_handler (vec_init_optab, mode, emode); - if (icode != CODE_FOR_nothing) + else if (exact_div (n_elts, GET_MODE_NUNITS (eltmode)) + .is_constant (&const_n_elts)) { - unsigned int n = const_n_elts; - - if (vector_typed_elts_p) - { - n = CONSTRUCTOR_NELTS (exp); - vec_vec_init_p = true; - } - vector = rtvec_alloc (n); - for (unsigned int k = 0; k < n; k++) - RTVEC_ELT (vector, k) = CONST0_RTX (emode); + /* For a non-const type vector, we check it is made up of + similarly non-const type vectors. */ + icode = convert_optab_handler (vec_init_optab, mode, eltmode); } + + if (const_n_elts && icode != CODE_FOR_nothing) + { + vector = rtvec_alloc (const_n_elts); + for (unsigned int k = 0; k < const_n_elts; k++) + RTVEC_ELT (vector, k) = CONST0_RTX (eltmode); + } } /* Compute the size of the elements in the CTOR. It differs
[gcc r15-6104] middle-end: refactor type to be explicit in operand_equal_p [PR114932]
https://gcc.gnu.org/g:3c32575e5b6370270d38a80a7fa8eaa144e083d0 commit r15-6104-g3c32575e5b6370270d38a80a7fa8eaa144e083d0 Author: Tamar Christina Date: Wed Dec 11 11:45:36 2024 + middle-end: refactor type to be explicit in operand_equal_p [PR114932] This is a refactoring with no expected behavioral change. The goal with this is to make the type of the expressions being used explicit. I did not change all the recursive calls to operand_equal_p () to recurse directly to the new function but instead this goes through the top level call which re-extracts the types. This was done because in most of the cases where we recurse type == arg. The second patch makes use of this new flexibility to implement an overload of operand_equal_p which checks for equality under two's complement. gcc/ChangeLog: PR tree-optimization/114932 * fold-const.cc (operand_compare::operand_equal_p): Split into one that takes explicit type parameters and use that in public one. * fold-const.h (class operand_compare): Add operand_equal_p private overload. Diff: --- gcc/fold-const.cc | 99 --- gcc/fold-const.h | 6 2 files changed, 57 insertions(+), 48 deletions(-) diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc index af2851ec0919..33dc3a731e45 100644 --- a/gcc/fold-const.cc +++ b/gcc/fold-const.cc @@ -3168,6 +3168,17 @@ combine_comparisons (location_t loc, bool operand_compare::operand_equal_p (const_tree arg0, const_tree arg1, unsigned int flags) +{ + return operand_equal_p (TREE_TYPE (arg0), arg0, TREE_TYPE (arg1), arg1, flags); +} + +/* The same as operand_equal_p however the type of ARG0 and ARG1 are assumed to be + the TYPE0 and TYPE1 respectively. */ + +bool +operand_compare::operand_equal_p (tree type0, const_tree arg0, + tree type1, const_tree arg1, + unsigned int flags) { bool r; if (verify_hash_value (arg0, arg1, flags, &r)) @@ -3178,25 +3189,25 @@ operand_compare::operand_equal_p (const_tree arg0, const_tree arg1, /* If either is ERROR_MARK, they aren't equal. */ if (TREE_CODE (arg0) == ERROR_MARK || TREE_CODE (arg1) == ERROR_MARK - || TREE_TYPE (arg0) == error_mark_node - || TREE_TYPE (arg1) == error_mark_node) + || type0 == error_mark_node + || type1 == error_mark_node) return false; /* Similar, if either does not have a type (like a template id), they aren't equal. */ - if (!TREE_TYPE (arg0) || !TREE_TYPE (arg1)) + if (!type0 || !type1) return false; /* Bitwise identity makes no sense if the values have different layouts. */ if ((flags & OEP_BITWISE) - && !tree_nop_conversion_p (TREE_TYPE (arg0), TREE_TYPE (arg1))) + && !tree_nop_conversion_p (type0, type1)) return false; /* We cannot consider pointers to different address space equal. */ - if (POINTER_TYPE_P (TREE_TYPE (arg0)) - && POINTER_TYPE_P (TREE_TYPE (arg1)) - && (TYPE_ADDR_SPACE (TREE_TYPE (TREE_TYPE (arg0))) - != TYPE_ADDR_SPACE (TREE_TYPE (TREE_TYPE (arg1) + if (POINTER_TYPE_P (type0) + && POINTER_TYPE_P (type1) + && (TYPE_ADDR_SPACE (TREE_TYPE (type0)) + != TYPE_ADDR_SPACE (TREE_TYPE (type1 return false; /* Check equality of integer constants before bailing out due to @@ -3216,19 +3227,20 @@ operand_compare::operand_equal_p (const_tree arg0, const_tree arg1, because they may change the signedness of the arguments. As pointers strictly don't have a signedness, require either two pointers or two non-pointers as well. */ - if (TYPE_UNSIGNED (TREE_TYPE (arg0)) != TYPE_UNSIGNED (TREE_TYPE (arg1)) - || POINTER_TYPE_P (TREE_TYPE (arg0)) -!= POINTER_TYPE_P (TREE_TYPE (arg1))) + if (TYPE_UNSIGNED (type0) != TYPE_UNSIGNED (type1) + || POINTER_TYPE_P (type0) != POINTER_TYPE_P (type1)) return false; /* If both types don't have the same precision, then it is not safe to strip NOPs. */ - if (element_precision (TREE_TYPE (arg0)) - != element_precision (TREE_TYPE (arg1))) + if (element_precision (type0) != element_precision (type1)) return false; STRIP_NOPS (arg0); STRIP_NOPS (arg1); + + type0 = TREE_TYPE (arg0); + type1 = TREE_TYPE (arg1); } #if 0 /* FIXME: Fortran FE currently produce ADDR_EXPR of NOP_EXPR. Enable the @@ -3287,9 +3299,9 @@ operand_compare::operand_equal_p (const_tree arg0, const_tree arg1, /* When not checking adddresses, this is needed for conversions and for COMPONENT_REF. Might as well play it safe and always test this. */ - if (TREE_CODE (TREE_TYPE (arg0)) == ERROR_MARK - || TREE_CODE (TREE_TYPE (arg1)) == ERROR_MARK -
[gcc r15-6105] middle-end: use two's complement equality when comparing IVs during candidate selection [PR114932]
https://gcc.gnu.org/g:9403b035befe3537c343f7430e321468c0f2c28b commit r15-6105-g9403b035befe3537c343f7430e321468c0f2c28b Author: Tamar Christina Date: Wed Dec 11 11:47:49 2024 + middle-end: use two's complement equality when comparing IVs during candidate selection [PR114932] IVOPTS normally uses affine trees to perform comparisons between different IVs, but these seem to have been missing in two key spots and instead normal tree equivalencies used. In some cases where we have a two-complements equivalence but not a strict signedness equivalencies we end up generating both a signed and unsigned IV for the same candidate. This patch implements a new OEP flag called OEP_ASSUME_WRAPV. This flag will check if the operands would produce the same bit values after the computations even if the final sign is different. This happens quite a lot with Fortran but can also happen in C because this came code is unable to figure out when one expression is a multiple of another. As an example in the attached testcase we get: Initial set of candidates: cost: 24 (complexity 3) reg_cost: 9 cand_cost: 15 cand_group_cost: 0 (complexity 3) candidates: 1, 6, 8 group:0 --> iv_cand:6, cost=(0,1) group:1 --> iv_cand:1, cost=(0,0) group:2 --> iv_cand:8, cost=(0,1) group:3 --> iv_cand:8, cost=(0,1) invariant variables: 6 invariant expressions: 1, 2 : inv_expr 1: stride.3_27 * 4 inv_expr 2: (unsigned long) stride.3_27 * 4 These end up being used in the same group: Group 1: cand costcompl. inv.expr. inv.vars 1 0 0 NIL;6 2 0 0 NIL;6 3 0 0 NIL;6 which ends up with IV opts picking the signed and unsigned IVs: Improved to: cost: 24 (complexity 3) reg_cost: 9 cand_cost: 15 cand_group_cost: 0 (complexity 3) candidates: 1, 6, 8 group:0 --> iv_cand:6, cost=(0,1) group:1 --> iv_cand:1, cost=(0,0) group:2 --> iv_cand:8, cost=(0,1) group:3 --> iv_cand:8, cost=(0,1) invariant variables: 6 invariant expressions: 1, 2 and so generates the same IV as both signed and unsigned: ;; basic block 21, loop depth 3, count 214748368 (estimated locally, freq 58.2545), maybe hot ;;prev block 28, next block 31, flags: (NEW, REACHABLE, VISITED) ;;pred: 28 [always] count:23622320 (estimated locally, freq 6.4080) (FALLTHRU,EXECUTABLE) ;;25 [always] count:191126046 (estimated locally, freq 51.8465) (FALLTHRU,DFS_BACK,EXECUTABLE) # .MEM_66 = PHI <.MEM_34(28), .MEM_22(25)> # ivtmp.22_41 = PHI <0(28), ivtmp.22_82(25)> # ivtmp.26_51 = PHI # ivtmp.28_90 = PHI ... ;; basic block 24, loop depth 3, count 214748366 (estimated locally, freq 58.2545), maybe hot ;;prev block 22, next block 25, flags: (NEW, REACHABLE, VISITED)' ;;pred: 22 [always] count:95443719 (estimated locally, freq 25.8909) (FALLTHRU) ;;21 [33.3% (guessed)] count:71582790 (estimated locally, freq 19.4182) (TRUE_VALUE,EXECUTABLE) ;;31 [33.3% (guessed)] count:47721860 (estimated locally, freq 12.9455) (TRUE_VALUE,EXECUTABLE) # .MEM_22 = PHI <.MEM_44(22), .MEM_31(21), .MEM_79(31)> ivtmp.22_82 = ivtmp.22_41 + 1; ivtmp.26_72 = ivtmp.26_51 + _80; ivtmp.28_98 = ivtmp.28_90 + _39; These two IVs are always used as unsigned, so IV ops generates: _73 = stride.3_27 * 4; _80 = (unsigned long) _73; _54 = (unsigned long) stride.3_27; _39 = _54 * 4; Which means that in e.g. exchange2 we generate a lot of duplicate code. This is because candidate 6 and 8 are equivalent under two's complement but have different signs. This patch changes it so that if you have two IVs that are affine equivalent to just pick one over the other. IV already has code for this, so the patch just uses affine trees instead of tree for the check. With it we get: : inv_expr 1: stride.3_27 * 4 : Group 0: cand costcompl. inv.expr. inv.vars 5 0 2 NIL;NIL; 6 0 3 NIL;NIL; Group 1: cand costcompl. inv.expr. inv.vars 1 0 0 NIL;6 2 0 0 NIL;6 3 0 0 NIL;6 4 0 0 NIL;6 Initial set of candidates: cost: 16 (complexity 3) reg_cost: 6 cand_cost: 10 cand_group_cost: 0 (complexity 3) candidates: 1, 6 group:0 --> iv_cand:6, cost=(0,3) group:1 --> iv_cand:1, cost=(0,0) invariant variables: 6 invariant expressions:
[gcc r15-6107] middle-end: Fix mask length arg in call to vect_get_loop_mask [PR96342]
https://gcc.gnu.org/g:240cbd2f26c0f1c1f83cfc3b69cc0271b56172e2 commit r15-6107-g240cbd2f26c0f1c1f83cfc3b69cc0271b56172e2 Author: Victor Do Nascimento Date: Wed Dec 11 11:58:55 2024 + middle-end: Fix mask length arg in call to vect_get_loop_mask [PR96342] When issuing multiple calls to a simdclone in a vectorized loop, TYPE_VECTOR_SUBPARTS(vectype) gives the incorrect number when compared to the TYPE_VECTOR_SUBPARTS result we get from the mask type derived from the relevant `rgroup_controls' entry within `vect_get_loop_mask'. By passing `masktype' instead, we are able to get the correct number of vector subparts and thu eliminate the ICE in the call to `vect_get_loop_mask' when the data type for which we retrieve the mask is wider than the one used when defining the mask at mask registration time. gcc/ChangeLog: PR target/96342 * tree-vect-stmts.cc (vectorizable_simd_clone_call): s/vectype/masktype/ in call to vect_get_loop_mask. Diff: --- gcc/tree-vect-stmts.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 497a31322acc..be1139a423c8 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -4964,7 +4964,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info stmt_info, { vec_loop_masks *loop_masks = &LOOP_VINFO_MASKS (loop_vinfo); mask = vect_get_loop_mask (loop_vinfo, gsi, loop_masks, -ncopies, vectype, j); +ncopies, masktype, j); } else mask = vect_build_all_ones_mask (vinfo, stmt_info, masktype);
[gcc r15-6106] middle-end: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE [PR96342]
https://gcc.gnu.org/g:561ef7c8477ba58ea64de259af9c2d0870afd9d4 commit r15-6106-g561ef7c8477ba58ea64de259af9c2d0870afd9d4 Author: Andre Vieira Date: Wed Dec 11 11:50:22 2024 + middle-end: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE [PR96342] This patch adds stmt_vec_info to TARGET_SIMD_CLONE_USABLE to make sure the target can reject a simd_clone based on the vector mode it is using. This is needed because for VLS SVE vectorization the vectorizer accepts Advanced SIMD simd clones when vectorizing using SVE types because the simdlens might match. This will cause type errors later on. Other targets do not currently need to use this argument. gcc/ChangeLog: PR target/96342 * target.def (TARGET_SIMD_CLONE_USABLE): Add argument. * tree-vect-stmts.cc (vectorizable_simd_clone_call): Pass stmt_info to call TARGET_SIMD_CLONE_USABLE. * config/aarch64/aarch64.cc (aarch64_simd_clone_usable): Add argument and use it to reject the use of SVE simd clones with Advanced SIMD modes. * config/gcn/gcn.cc (gcn_simd_clone_usable): Add unused argument. * config/i386/i386.cc (ix86_simd_clone_usable): Likewise. * doc/tm.texi: Regenerate Co-authored-by: Victor Do Nascimento Co-authored-by: Tamar Christina Diff: --- gcc/config/aarch64/aarch64.cc | 4 ++-- gcc/config/gcn/gcn.cc | 3 ++- gcc/config/i386/i386.cc | 2 +- gcc/doc/tm.texi | 8 gcc/target.def| 8 gcc/tree-vect-stmts.cc| 9 - 6 files changed, 21 insertions(+), 13 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 4d1b3cca0c42..77a2a6bfa3a3 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -29490,12 +29490,12 @@ aarch64_simd_clone_adjust (struct cgraph_node *node) /* Implement TARGET_SIMD_CLONE_USABLE. */ static int -aarch64_simd_clone_usable (struct cgraph_node *node) +aarch64_simd_clone_usable (struct cgraph_node *node, machine_mode vector_mode) { switch (node->simdclone->vecsize_mangle) { case 'n': - if (!TARGET_SIMD) + if (!TARGET_SIMD || aarch64_sve_mode_p (vector_mode)) return -1; return 0; default: diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc index d017f22d1bc4..634171a0a93b 100644 --- a/gcc/config/gcn/gcn.cc +++ b/gcc/config/gcn/gcn.cc @@ -5653,7 +5653,8 @@ gcn_simd_clone_adjust (struct cgraph_node *ARG_UNUSED (node)) /* Implement TARGET_SIMD_CLONE_USABLE. */ static int -gcn_simd_clone_usable (struct cgraph_node *ARG_UNUSED (node)) +gcn_simd_clone_usable (struct cgraph_node *ARG_UNUSED (node), + machine_mode ARG_UNUSED (vector_mode)) { /* We don't need to do anything here because gcn_simd_clone_compute_vecsize_and_simdlen currently only returns one diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 62f758b32ef5..ca763e1eb334 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -25721,7 +25721,7 @@ ix86_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, slightly less desirable, etc.). */ static int -ix86_simd_clone_usable (struct cgraph_node *node) +ix86_simd_clone_usable (struct cgraph_node *node, machine_mode) { switch (node->simdclone->vecsize_mangle) { diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 7e8e02e3f423..d7170f452068 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -6531,11 +6531,11 @@ This hook should add implicit @code{attribute(target("..."))} attribute to SIMD clone @var{node} if needed. @end deftypefn -@deftypefn {Target Hook} int TARGET_SIMD_CLONE_USABLE (struct cgraph_node *@var{}) +@deftypefn {Target Hook} int TARGET_SIMD_CLONE_USABLE (struct cgraph_node *@var{}, @var{machine_mode}) This hook should return -1 if SIMD clone @var{node} shouldn't be used -in vectorized loops in current function, or non-negative number if it is -usable. In that case, the smaller the number is, the more desirable it is -to use it. +in vectorized loops in current function with @var{vector_mode}, or +non-negative number if it is usable. In that case, the smaller the number +is, the more desirable it is to use it. @end deftypefn @deftypefn {Target Hook} int TARGET_SIMT_VF (void) diff --git a/gcc/target.def b/gcc/target.def index 5ee33bf0cf91..8cf29c57eaee 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -1645,10 +1645,10 @@ void, (struct cgraph_node *), NULL) DEFHOOK (usable, "This hook should return -1 if SIMD clone @var{node} shouldn't be used\n\ -in vectorized loops in current function, or non-negative number if it is\n\ -usable. In that case, the smaller the number is, the more desirable it is\n\ -to use it.", -int, (struct cgraph_node *), NULL) +in vectorized loops in current function with @var{vector_mode}, or\n\ +non-ne
[gcc r15-6262] arm: fix bootstrap after MVE changes
https://gcc.gnu.org/g:7b5599dbd75fe1ee7d861d4cfc6ea655a126bef3 commit r15-6262-g7b5599dbd75fe1ee7d861d4cfc6ea655a126bef3 Author: Tamar Christina Date: Sun Dec 15 13:21:44 2024 + arm: fix bootstrap after MVE changes The recent commits for MVE on Saturday have broken armhf bootstrap due to a -Werror false positive: inlined from 'virtual rtx_def* {anonymous}::vstrq_scatter_base_impl::expand(arm_mve::function_expander&) const' at /gcc/config/arm/arm-mve-builtins-base.cc:352:17: ./genrtl.h:38:16: error: 'new_base' may be used uninitialized [-Werror=maybe-uninitialized] 38 | XEXP (rt, 1) = arg1; /gcc/config/arm/arm-mve-builtins-base.cc: In member function 'virtual rtx_def* {anonymous}::vstrq_scatter_base_impl::expand(arm_mve::function_expander&) const': /gcc/config/arm/arm-mve-builtins-base.cc:311:26: note: 'new_base' was declared here 311 | rtx insns, base_ptr, new_base; | ^~~~ In function 'rtx_def* init_rtx_fmt_ee(rtx, machine_mode, rtx, rtx)', inlined from 'rtx_def* gen_rtx_fmt_ee_stat(rtx_code, machine_mode, rtx, rtx)' at ./genrtl.h:50:26, inlined from 'virtual rtx_def* {anonymous}::vldrq_gather_base_impl::expand(arm_mve::function_expander&) const' at /gcc/config/arm/arm-mve-builtins-base.cc:527:17: ./genrtl.h:38:16: error: 'new_base' may be used uninitialized [-Werror=maybe-uninitialized] 38 | XEXP (rt, 1) = arg1; /gcc/config/arm/arm-mve-builtins-base.cc: In member function 'virtual rtx_def* {anonymous}::vldrq_gather_base_impl::expand(arm_mve::function_expander&) const': /gcc/config/arm/arm-mve-builtins-base.cc:486:26: note: 'new_base' was declared here 486 | rtx insns, base_ptr, new_base; To fix it I just initialize the value. gcc/ChangeLog: * config/arm/arm-mve-builtins-base.cc (expand): Initialize new_base. Diff: --- gcc/config/arm/arm-mve-builtins-base.cc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/gcc/config/arm/arm-mve-builtins-base.cc b/gcc/config/arm/arm-mve-builtins-base.cc index 723004b53d7b..ef3c504b1b30 100644 --- a/gcc/config/arm/arm-mve-builtins-base.cc +++ b/gcc/config/arm/arm-mve-builtins-base.cc @@ -308,7 +308,7 @@ public: rtx expand (function_expander &e) const override { insn_code icode; -rtx insns, base_ptr, new_base; +rtx insns, base_ptr, new_base = NULL_RTX; machine_mode base_mode; if ((e.mode_suffix_id != MODE_none) @@ -483,7 +483,7 @@ public: rtx expand (function_expander &e) const override { insn_code icode; -rtx insns, base_ptr, new_base; +rtx insns, base_ptr, new_base = NULL_RTX; machine_mode base_mode; if ((e.mode_suffix_id != MODE_none)
[gcc r15-6217] AArch64: Set L1 data cache size according to size on CPUs
https://gcc.gnu.org/g:6a5a1b8175e07ff578204476cd5d8a071cbc commit r15-6217-g6a5a1b8175e07ff578204476cd5d8a071cbc Author: Tamar Christina Date: Fri Dec 13 11:20:18 2024 + AArch64: Set L1 data cache size according to size on CPUs This sets the L1 data cache size for some cores based on their size in their Technical Reference Manuals. Today the port minimum is 256 bytes as explained in commit g:9a99559a478111f7fbeec29bd78344df7651c707, however like Neoverse V2 most cores actually define the L1 cache size as 64-bytes. The generic Armv9-A model was already changed in g:f000cb8cbc58b23a91c84d47d69481904981a1d9 and this change follows suite for a few other cores based on their TRMs. This results in less memory pressure when running on large core count machines. gcc/ChangeLog: * config/aarch64/tuning_models/cortexx925.h: Set L1 cache size to 64b. * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. * config/aarch64/tuning_models/neoversen1.h: Likewise. * config/aarch64/tuning_models/neoversen2.h: Likewise. * config/aarch64/tuning_models/neoversen3.h: Likewise. * config/aarch64/tuning_models/neoversev1.h: Likewise. * config/aarch64/tuning_models/neoversev2.h: Likewise. (neoversev2_prefetch_tune): Removed. * config/aarch64/tuning_models/neoversev3.h: Likewise. * config/aarch64/tuning_models/neoversev3ae.h: Likewise. Diff: --- gcc/config/aarch64/tuning_models/cortexx925.h | 2 +- gcc/config/aarch64/tuning_models/neoverse512tvb.h | 2 +- gcc/config/aarch64/tuning_models/neoversen1.h | 2 +- gcc/config/aarch64/tuning_models/neoversen2.h | 2 +- gcc/config/aarch64/tuning_models/neoversen3.h | 2 +- gcc/config/aarch64/tuning_models/neoversev1.h | 2 +- gcc/config/aarch64/tuning_models/neoversev2.h | 15 +-- gcc/config/aarch64/tuning_models/neoversev3.h | 2 +- gcc/config/aarch64/tuning_models/neoversev3ae.h | 2 +- 9 files changed, 9 insertions(+), 22 deletions(-) diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h b/gcc/config/aarch64/tuning_models/cortexx925.h index ef4c7d1a8323..5ebaf66e986c 100644 --- a/gcc/config/aarch64/tuning_models/cortexx925.h +++ b/gcc/config/aarch64/tuning_models/cortexx925.h @@ -224,7 +224,7 @@ static const struct tune_params cortexx925_tunings = | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ - &generic_prefetch_tune, + &generic_armv9a_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model. */ }; diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h b/gcc/config/aarch64/tuning_models/neoverse512tvb.h index f72505918f3a..007f987154c4 100644 --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h @@ -158,7 +158,7 @@ static const struct tune_params neoverse512tvb_tunings = (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),/* tune_flags. */ - &generic_prefetch_tune, + &generic_armv9a_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model. */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversen1.h b/gcc/config/aarch64/tuning_models/neoversen1.h index 3079eb2d9ec3..14b9ac9a734d 100644 --- a/gcc/config/aarch64/tuning_models/neoversen1.h +++ b/gcc/config/aarch64/tuning_models/neoversen1.h @@ -52,7 +52,7 @@ static const struct tune_params neoversen1_tunings = 0, /* max_case_values. */ tune_params::AUTOPREFETCHER_WEAK,/* autoprefetcher_model. */ (AARCH64_EXTRA_TUNE_BASE), /* tune_flags. */ - &generic_prefetch_tune, + &generic_armv9a_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ AARCH64_LDP_STP_POLICY_ALWAYS/* stp_policy_model. */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h b/gcc/config/aarch64/tuning_models/neoversen2.h index 141c994df381..32560d2f5f88 100644 --- a/gcc/config/aarch64/tuning_models/neoversen2.h +++ b/gcc/config/aarch64/tuning_models/neoversen2.h @@ -222,7 +222,7 @@ static const struct tune_params neoversen2_tunings = | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ - &generic_prefetch_tune, + &generic_armv9a_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model. */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h b/gcc/config/aarch64/tuning_models/neoversen3.h index b3e31
[gcc r15-6216] AArch64: Add CMP+CSEL and CMP+CSET for cores that support it
https://gcc.gnu.org/g:4a9427f75b9f5dfbd9edd0ec8e0a07f868754b65 commit r15-6216-g4a9427f75b9f5dfbd9edd0ec8e0a07f868754b65 Author: Tamar Christina Date: Fri Dec 13 11:17:55 2024 + AArch64: Add CMP+CSEL and CMP+CSET for cores that support it GCC 15 added two new fusions CMP+CSEL and CMP+CSET. This patch enables them for cores that support based on their Software Optimization Guides and generically on Armv9-A. Even if a core does not support it there's no negative performance impact. gcc/ChangeLog: * config/aarch64/aarch64-fusion-pairs.def (AARCH64_FUSE_NEOVERSE_BASE): New. * config/aarch64/tuning_models/neoverse512tvb.h: Use it. * config/aarch64/tuning_models/neoversen2.h: Use it. * config/aarch64/tuning_models/neoversen3.h: Use it. * config/aarch64/tuning_models/neoversev1.h: Use it. * config/aarch64/tuning_models/neoversev2.h: Use it. * config/aarch64/tuning_models/neoversev3.h: Use it. * config/aarch64/tuning_models/neoversev3ae.h: Use it. * config/aarch64/tuning_models/cortexx925.h: Add fusions. * config/aarch64/tuning_models/generic_armv9_a.h: Add fusions. Diff: --- gcc/config/aarch64/aarch64-fusion-pairs.def| 4 gcc/config/aarch64/tuning_models/cortexx925.h | 4 +++- gcc/config/aarch64/tuning_models/generic_armv9_a.h | 4 +++- gcc/config/aarch64/tuning_models/neoverse512tvb.h | 2 +- gcc/config/aarch64/tuning_models/neoversen2.h | 2 +- gcc/config/aarch64/tuning_models/neoversen3.h | 2 +- gcc/config/aarch64/tuning_models/neoversev1.h | 2 +- gcc/config/aarch64/tuning_models/neoversev2.h | 2 +- gcc/config/aarch64/tuning_models/neoversev3.h | 2 +- gcc/config/aarch64/tuning_models/neoversev3ae.h| 2 +- 10 files changed, 17 insertions(+), 9 deletions(-) diff --git a/gcc/config/aarch64/aarch64-fusion-pairs.def b/gcc/config/aarch64/aarch64-fusion-pairs.def index f8413ab0c802..0123430d988b 100644 --- a/gcc/config/aarch64/aarch64-fusion-pairs.def +++ b/gcc/config/aarch64/aarch64-fusion-pairs.def @@ -45,4 +45,8 @@ AARCH64_FUSION_PAIR ("cmp+cset", CMP_CSET) /* Baseline fusion settings suitable for all cores. */ #define AARCH64_FUSE_BASE (AARCH64_FUSE_CMP_BRANCH | AARCH64_FUSE_AES_AESMC) +/* Baseline fusion settings suitable for all Neoverse cores. */ +#define AARCH64_FUSE_NEOVERSE_BASE (AARCH64_FUSE_BASE | AARCH64_FUSE_CMP_CSEL \ + | AARCH64_FUSE_CMP_CSET) + #define AARCH64_FUSE_MOVK (AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_MOVK_MOVK) diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h b/gcc/config/aarch64/tuning_models/cortexx925.h index b2ff716157a4..ef4c7d1a8323 100644 --- a/gcc/config/aarch64/tuning_models/cortexx925.h +++ b/gcc/config/aarch64/tuning_models/cortexx925.h @@ -205,7 +205,9 @@ static const struct tune_params cortexx925_tunings = 2 /* store_pred. */ }, /* memmov_cost. */ 10, /* issue_rate */ - AARCH64_FUSE_BASE, /* fusible_ops */ + (AARCH64_FUSE_BASE + | AARCH64_FUSE_CMP_CSEL + | AARCH64_FUSE_CMP_CSET), /* fusible_ops */ "32:16", /* function_align. */ "4", /* jump_align. */ "32:16", /* loop_align. */ diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h b/gcc/config/aarch64/tuning_models/generic_armv9_a.h index a05a9ab92a27..785e00946bc4 100644 --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h @@ -236,7 +236,9 @@ static const struct tune_params generic_armv9_a_tunings = 1 /* store_pred. */ }, /* memmov_cost. */ 3, /* issue_rate */ - AARCH64_FUSE_BASE, /* fusible_ops */ + (AARCH64_FUSE_BASE + | AARCH64_FUSE_CMP_CSEL + | AARCH64_FUSE_CMP_CSET), /* fusible_ops */ "32:16", /* function_align. */ "4", /* jump_align. */ "32:16", /* loop_align. */ diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h b/gcc/config/aarch64/tuning_models/neoverse512tvb.h index c407b89a22f1..f72505918f3a 100644 --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h @@ -143,7 +143,7 @@ static const struct tune_params neoverse512tvb_tunings = 1 /* store_pred. */ }, /* memmov_cost. */ 3, /* issue_rate */ - AARCH64_FUSE_BASE, /* fusible_ops */ + AARCH64_FUSE_NEOVERSE_BASE, /* fusible_ops */ "32:16", /* function_align. */ "4", /* jump_align. */ "32:16", /* loop_align. */ diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h b/gcc/config/aarch64/tuning_models/neoversen2.h index fd5f8f373705..141c994df381 100644 --- a/gcc/config/aarch64/tuning_models/neoversen2.h +++ b/gcc/config/aarch64/tuning_models/neoversen2.h @@ -205,7 +205,7 @@ static const struct tune_params neoversen2_tunings = 1 /* store_pred. */ }, /* memmov_cost. */ 5, /* issue_rate */
[gcc r15-6391] AArch64: Disable `omp declare variant' tests for aarch64 [PR96342]
https://gcc.gnu.org/g:6ecb365d4c3f36eaf684c38fc5d9008a1409c725 commit r15-6391-g6ecb365d4c3f36eaf684c38fc5d9008a1409c725 Author: Tamar Christina Date: Fri Dec 20 14:25:50 2024 + AArch64: Disable `omp declare variant' tests for aarch64 [PR96342] These tests are x86 specific and shouldn't be run for aarch64. gcc/testsuite/ChangeLog: PR target/96342 * c-c++-common/gomp/declare-variant-14.c: Make i?86 and x86_64 target only test. * gfortran.dg/gomp/declare-variant-14.f90: Likewise. Diff: --- gcc/testsuite/c-c++-common/gomp/declare-variant-14.c | 13 + gcc/testsuite/gfortran.dg/gomp/declare-variant-14.f90 | 11 --- 2 files changed, 9 insertions(+), 15 deletions(-) diff --git a/gcc/testsuite/c-c++-common/gomp/declare-variant-14.c b/gcc/testsuite/c-c++-common/gomp/declare-variant-14.c index e3668893afe3..8a6bf09d3cf6 100644 --- a/gcc/testsuite/c-c++-common/gomp/declare-variant-14.c +++ b/gcc/testsuite/c-c++-common/gomp/declare-variant-14.c @@ -1,6 +1,5 @@ -/* { dg-do compile { target vect_simd_clones } } */ -/* { dg-additional-options "-fdump-tree-gimple -fdump-tree-optimized" } */ -/* { dg-additional-options "-mno-sse3" { target { i?86-*-* x86_64-*-* } } } */ +/* { dg-do compile { target { { i?86-*-* x86_64-*-* } && vect_simd_clones } } } */ +/* { dg-additional-options "-mno-sse3 -fdump-tree-gimple -fdump-tree-optimized" } */ int f01 (int); int f02 (int); @@ -15,15 +14,13 @@ int test1 (int x) { /* At gimplification time, we can't decide yet which function to call. */ - /* { dg-final { scan-tree-dump-times "f04 \\\(x" 2 "gimple" { target { !aarch64*-*-* } } } } */ + /* { dg-final { scan-tree-dump-times "f04 \\\(x" 2 "gimple" } } */ /* After simd clones are created, the original non-clone test1 shall call f03 (score 6), the sse2/avx/avx2 clones too, but avx512f clones shall call f01 with score 8. */ /* { dg-final { scan-tree-dump-not "f04 \\\(x" "optimized" } } */ - /* { dg-final { scan-tree-dump-times "f03 \\\(x" 14 "optimized" { target { !aarch64*-*-* } } } } */ - /* { dg-final { scan-tree-dump-times "f03 \\\(x" 10 "optimized" { target { aarch64*-*-* } } } } */ - /* { dg-final { scan-tree-dump-times "f01 \\\(x" 4 "optimized" { target { !aarch64*-*-* } } } } */ - /* { dg-final { scan-tree-dump-times "f01 \\\(x" 0 "optimized" { target { aarch64*-*-* } } } } */ + /* { dg-final { scan-tree-dump-times "f03 \\\(x" 14 "optimized" } } */ + /* { dg-final { scan-tree-dump-times "f01 \\\(x" 4 "optimized" } } */ int a = f04 (x); int b = f04 (x); return a + b; diff --git a/gcc/testsuite/gfortran.dg/gomp/declare-variant-14.f90 b/gcc/testsuite/gfortran.dg/gomp/declare-variant-14.f90 index 6319df0558f3..e154d93d73a5 100644 --- a/gcc/testsuite/gfortran.dg/gomp/declare-variant-14.f90 +++ b/gcc/testsuite/gfortran.dg/gomp/declare-variant-14.f90 @@ -1,6 +1,5 @@ -! { dg-do compile { target vect_simd_clones } } -! { dg-additional-options "-O0 -fdump-tree-gimple -fdump-tree-optimized" } -! { dg-additional-options "-mno-sse3" { target { i?86-*-* x86_64-*-* } } } +! { dg-do compile { target { { i?86-*-* x86_64-*-* } && vect_simd_clones } } } */ +! { dg-additional-options "-mno-sse3 -O0 -fdump-tree-gimple -fdump-tree-optimized" } module main implicit none @@ -40,10 +39,8 @@ contains ! call f03 (score 6), the sse2/avx/avx2 clones too, but avx512f clones ! shall call f01 with score 8. ! { dg-final { scan-tree-dump-not "f04 \\\(x" "optimized" } } -! { dg-final { scan-tree-dump-times "f03 \\\(x" 14 "optimized" { target { !aarch64*-*-* } } } } -! { dg-final { scan-tree-dump-times "f03 \\\(x" 6 "optimized" { target { aarch64*-*-* } } } } -! { dg-final { scan-tree-dump-times "f01 \\\(x" 4 "optimized" { target { !aarch64*-*-* } } } } -! { dg-final { scan-tree-dump-times "f01 \\\(x" 0 "optimized" { target { aarch64*-*-* } } } } +! { dg-final { scan-tree-dump-times "f03 \\\(x" 14 "optimized" } } +! { dg-final { scan-tree-dump-times "f01 \\\(x" 4 "optimized" } } a = f04 (x) b = f04 (x) test1 = a + b
[gcc r15-6392] AArch64: Add SVE support for simd clones [PR96342]
https://gcc.gnu.org/g:d7d3dfe7a2a26e370805ddf835bfd00c51d32f1b commit r15-6392-gd7d3dfe7a2a26e370805ddf835bfd00c51d32f1b Author: Tamar Christina Date: Fri Dec 20 14:27:25 2024 + AArch64: Add SVE support for simd clones [PR96342] This patch finalizes adding support for the generation of SVE simd clones when no simdlen is provided, following the ABI rules where the widest data type determines the minimum amount of elements in a length agnostic vector. gcc/ChangeLog: PR target/96342 * config/aarch64/aarch64-protos.h (add_sve_type_attribute): Declare. * config/aarch64/aarch64-sve-builtins.cc (add_sve_type_attribute): Make visibility global and support use for non_acle types. * config/aarch64/aarch64.cc (aarch64_simd_clone_compute_vecsize_and_simdlen): Create VLA simd clone when no simdlen is provided, according to ABI rules. (simd_clone_adjust_sve_vector_type): New helper function. (aarch64_simd_clone_adjust): Add '+sve' attribute to SVE simd clones and modify types to use SVE types. * omp-simd-clone.cc (simd_clone_mangle): Print 'x' for VLA simdlen. (simd_clone_adjust): Adapt safelen check to be compatible with VLA simdlen. gcc/testsuite/ChangeLog: PR target/96342 * gcc.target/aarch64/declare-simd-2.c: Add SVE clone scan. * gcc.target/aarch64/vect-simd-clone-1.c: New test. * g++.target/aarch64/vect-simd-clone-1.C: New test. Co-authored-by: Victor Do Nascimento Co-authored-by: Tamar Christina Diff: --- gcc/config/aarch64/aarch64-protos.h| 2 + gcc/config/aarch64/aarch64-sve-builtins.cc | 9 +- gcc/config/aarch64/aarch64.cc | 175 + gcc/omp-simd-clone.cc | 13 +- .../g++.target/aarch64/vect-simd-clone-1.C | 88 +++ gcc/testsuite/gcc.target/aarch64/declare-simd-2.c | 1 + .../gcc.target/aarch64/vect-simd-clone-1.c | 89 +++ 7 files changed, 342 insertions(+), 35 deletions(-) diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h index bd17486e9128..7ab1316cf568 100644 --- a/gcc/config/aarch64/aarch64-protos.h +++ b/gcc/config/aarch64/aarch64-protos.h @@ -1151,6 +1151,8 @@ namespace aarch64_sve { #ifdef GCC_TARGET_H bool verify_type_context (location_t, type_context_kind, const_tree, bool); #endif + void add_sve_type_attribute (tree, unsigned int, unsigned int, + const char *, const char *); } extern void aarch64_split_combinev16qi (rtx operands[3]); diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc b/gcc/config/aarch64/aarch64-sve-builtins.cc index 5acc56f99c65..e93c3a78e6d6 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc @@ -1032,15 +1032,18 @@ static GTY(()) hash_map *overload_names[2]; /* Record that TYPE is an ABI-defined SVE type that contains NUM_ZR SVE vectors and NUM_PR SVE predicates. MANGLED_NAME, if nonnull, is the ABI-defined - mangling of the type. ACLE_NAME is the name of the type. */ -static void + mangling of the type. mangling of the type. ACLE_NAME is the + name of the type, or null if does not provide the type. */ +void add_sve_type_attribute (tree type, unsigned int num_zr, unsigned int num_pr, const char *mangled_name, const char *acle_name) { tree mangled_name_tree = (mangled_name ? get_identifier (mangled_name) : NULL_TREE); + tree acle_name_tree += (acle_name ? get_identifier (acle_name) : NULL_TREE); - tree value = tree_cons (NULL_TREE, get_identifier (acle_name), NULL_TREE); + tree value = tree_cons (NULL_TREE, acle_name_tree, NULL_TREE); value = tree_cons (NULL_TREE, mangled_name_tree, value); value = tree_cons (NULL_TREE, size_int (num_pr), value); value = tree_cons (NULL_TREE, size_int (num_zr), value); diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 77a2a6bfa3a3..de4c0a078391 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -29323,7 +29323,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, int num, bool explicit_p) { tree t, ret_type; - unsigned int nds_elt_bits; + unsigned int nds_elt_bits, wds_elt_bits; unsigned HOST_WIDE_INT const_simdlen; if (!TARGET_SIMD) @@ -29368,10 +29368,14 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, if (TREE_CODE (ret_type) != VOID_TYPE) { nds_elt_bits = lane_size (SIMD_CLONE_ARG_TYPE_VECTOR, ret_type); + wds_elt_bits = nds_elt_bits; vec_elts.safe_push (std::make_pair (ret_type, nds_elt_bits)); } else -nds_elt_bits = POINT
[gcc r15-6393] AArch64: Implement vector concat of partial SVE vectors [PR96342]
https://gcc.gnu.org/g:89b2c7dc96c4944c306131b665a4738a8a99413e commit r15-6393-g89b2c7dc96c4944c306131b665a4738a8a99413e Author: Tamar Christina Date: Fri Dec 20 14:34:32 2024 + AArch64: Implement vector concat of partial SVE vectors [PR96342] This patch adds support for vector constructor from two partial SVE vectors into a full SVE vector. It also implements support for the standard vec_init obtab to do this. gcc/ChangeLog: PR target/96342 * config/aarch64/aarch64-protos.h (aarch64_sve_expand_vector_init_subvector): New. * config/aarch64/aarch64-sve.md (vec_init): New. (@aarch64_pack_partial): New. * config/aarch64/aarch64.cc (aarch64_sve_expand_vector_init_subvector): New. * config/aarch64/iterators.md (SVE_NO2E): New. (VHALF, Vhalf): Add SVE partial vectors. gcc/testsuite/ChangeLog: PR target/96342 * gcc.target/aarch64/vect-simd-clone-2.c: New test. Diff: --- gcc/config/aarch64/aarch64-protos.h| 1 + gcc/config/aarch64/aarch64-sve.md | 23 + gcc/config/aarch64/aarch64.cc | 24 ++ gcc/config/aarch64/iterators.md| 20 -- .../gcc.target/aarch64/vect-simd-clone-2.c | 13 5 files changed, 79 insertions(+), 2 deletions(-) diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h index 7ab1316cf568..18764e407c13 100644 --- a/gcc/config/aarch64/aarch64-protos.h +++ b/gcc/config/aarch64/aarch64-protos.h @@ -1028,6 +1028,7 @@ rtx aarch64_replace_reg_mode (rtx, machine_mode); void aarch64_split_sve_subreg_move (rtx, rtx, rtx); void aarch64_expand_prologue (void); void aarch64_expand_vector_init (rtx, rtx); +void aarch64_sve_expand_vector_init_subvector (rtx, rtx); void aarch64_sve_expand_vector_init (rtx, rtx); void aarch64_init_cumulative_args (CUMULATIVE_ARGS *, const_tree, rtx, const_tree, unsigned, bool = false); diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index a72ca2a500d3..6659bb4fcab3 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -2839,6 +2839,16 @@ } ) +(define_expand "vec_init" + [(match_operand:SVE_NO2E 0 "register_operand") + (match_operand 1 "")] + "TARGET_SVE" + { +aarch64_sve_expand_vector_init_subvector (operands[0], operands[1]); +DONE; + } +) + ;; Shift an SVE vector left and insert a scalar into element 0. (define_insn "vec_shl_insert_" [(set (match_operand:SVE_FULL 0 "register_operand") @@ -9289,6 +9299,19 @@ "uzp1\t%0., %1., %2." ) +;; Integer partial pack packing two partial SVE types into a single full SVE +;; type of the same element type. Use UZP1 on the wider type, which discards +;; the high part of each wide element. This allows to concat SVE partial types +;; into a wider vector. +(define_insn "@aarch64_pack_partial" + [(set (match_operand:SVE_NO2E 0 "register_operand" "=w") + (vec_concat:SVE_NO2E + (match_operand: 1 "register_operand" "w") + (match_operand: 2 "register_operand" "w")))] + "TARGET_SVE" + "uzp1\t%0., %1., %2." +) + ;; - ;; [INT<-INT] Unpacks ;; - diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index de4c0a078391..41cc2eeec9a4 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -24870,6 +24870,30 @@ aarch64_sve_expand_vector_init (rtx target, rtx vals) aarch64_sve_expand_vector_init_insert_elems (target, v, nelts); } +/* Initialize register TARGET from the two vector subelements in PARALLEL + rtx VALS. */ + +void +aarch64_sve_expand_vector_init_subvector (rtx target, rtx vals) +{ + machine_mode mode = GET_MODE (target); + int nelts = XVECLEN (vals, 0); + + gcc_assert (nelts == 2); + + rtx arg0 = XVECEXP (vals, 0, 0); + rtx arg1 = XVECEXP (vals, 0, 1); + + /* If we have two elements and are concatting vector. */ + machine_mode elem_mode = GET_MODE (arg0); + gcc_assert (VECTOR_MODE_P (elem_mode)); + + arg0 = force_reg (elem_mode, arg0); + arg1 = force_reg (elem_mode, arg1); + emit_insn (gen_aarch64_pack_partial (mode, target, arg0, arg1)); + return; +} + /* Check whether VALUE is a vector constant in which every element is either a power of 2 or a negated power of 2. If so, return a constant vector of log2s, and flip CODE between PLUS and MINUS diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md index 89c72b24aeb7..34200b05a3ab 100644 --- a/gcc/config/aarch64/iterators.md +++ b/gcc/config/aarch64/iterators.md @@ -140,6 +140,10 @@ ;; VQ without 2 element modes. (define_
[gcc r15-5565] middle-end: Pass along SLP node when costing vector loads/stores
https://gcc.gnu.org/g:dbc38dd9e96a9995298da2478041bdbbf247c479 commit r15-5565-gdbc38dd9e96a9995298da2478041bdbbf247c479 Author: Tamar Christina Date: Thu Nov 21 12:49:35 2024 + middle-end: Pass along SLP node when costing vector loads/stores With the support to SLP only we now pass the VMAT through the SLP node, however the majority of the costing calls inside vectorizable_load and vectorizable_store do no pass the SLP node along. Due to this the backend costing never sees the VMAT for these cases anymore. Additionally the helper around record_stmt_cost when both SLP and stmt_vinfo are passed would only pass the SLP node along. However the SLP node doesn't contain all the info available in the stmt_vinfo and we'd have to go through the SLP_TREE_REPRESENTATIVE anyway. As such I changed the function to just Always pass both along. Unlike the VMAT changes, I don't believe there to be a correctness issue here but would minimize the number of churn in the backend costing until vectorizer costing as a whole is revisited in GCC 16. These changes re-enable the cost model on AArch64 and also correctly find the VMATs on loads and stores fixing testcases such as sve_iters_low_2.c. gcc/ChangeLog: * tree-vect-data-refs.cc (vect_get_data_access_cost): Pass NULL for SLP node. * tree-vect-stmts.cc (record_stmt_cost): Expose. (vect_get_store_cost, vect_get_load_cost): Extend with SLP node. (vectorizable_store, vectorizable_load): Pass SLP node to all costing. * tree-vectorizer.h (record_stmt_cost): Always pass both SLP node and stmt_vinfo to costing. (vect_get_load_cost, vect_get_store_cost): Extend with SLP node. Diff: --- gcc/tree-vect-data-refs.cc | 12 ++--- gcc/tree-vect-stmts.cc | 109 + gcc/tree-vectorizer.h | 16 +++ 3 files changed, 76 insertions(+), 61 deletions(-) diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc index a32343c0022b..35c946ab2d4e 100644 --- a/gcc/tree-vect-data-refs.cc +++ b/gcc/tree-vect-data-refs.cc @@ -1729,12 +1729,14 @@ vect_get_data_access_cost (vec_info *vinfo, dr_vec_info *dr_info, ncopies = vect_get_num_copies (loop_vinfo, STMT_VINFO_VECTYPE (stmt_info)); if (DR_IS_READ (dr_info->dr)) -vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme, - misalignment, true, inside_cost, - outside_cost, prologue_cost_vec, body_cost_vec, false); +vect_get_load_cost (vinfo, stmt_info, NULL, ncopies, + alignment_support_scheme, misalignment, true, + inside_cost, outside_cost, prologue_cost_vec, + body_cost_vec, false); else -vect_get_store_cost (vinfo,stmt_info, ncopies, alignment_support_scheme, -misalignment, inside_cost, body_cost_vec); +vect_get_store_cost (vinfo,stmt_info, NULL, ncopies, +alignment_support_scheme, misalignment, inside_cost, +body_cost_vec); if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 75973c77236e..e500902a8be9 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -93,7 +93,7 @@ stmt_in_inner_loop_p (vec_info *vinfo, class _stmt_vec_info *stmt_info) target model or by saving it in a vector for later processing. Return a preliminary estimate of the statement's cost. */ -static unsigned +unsigned record_stmt_cost (stmt_vector_for_cost *body_cost_vec, int count, enum vect_cost_for_stmt kind, stmt_vec_info stmt_info, slp_tree node, @@ -1008,8 +1008,8 @@ cfun_returns (tree decl) /* Calculate cost of DR's memory access. */ void -vect_get_store_cost (vec_info *, stmt_vec_info stmt_info, int ncopies, -dr_alignment_support alignment_support_scheme, +vect_get_store_cost (vec_info *, stmt_vec_info stmt_info, slp_tree slp_node, +int ncopies, dr_alignment_support alignment_support_scheme, int misalignment, unsigned int *inside_cost, stmt_vector_for_cost *body_cost_vec) @@ -1019,7 +1019,7 @@ vect_get_store_cost (vec_info *, stmt_vec_info stmt_info, int ncopies, case dr_aligned: { *inside_cost += record_stmt_cost (body_cost_vec, ncopies, - vector_store, stmt_info, 0, + vector_store, stmt_info, slp_node, 0, vect_body); if (dump_enabled_p ()) @@ -1032,7 +1032,7 @@ vect_get_store_cost (vec_info *, stmt_vec_info stmt_info, int ncopies, { /* Here, we assign an additi
[gcc r15-6752] AArch64: Fix costing of emulated gathers/scatters [PR118188]
https://gcc.gnu.org/g:08b6e875c6b1b52c6e98f4a2e37124bf8c6a6ccb commit r15-6752-g08b6e875c6b1b52c6e98f4a2e37124bf8c6a6ccb Author: Tamar Christina Date: Thu Jan 9 21:31:05 2025 + AArch64: Fix costing of emulated gathers/scatters [PR118188] When a target does not support gathers and scatters the vectorizer tries to emulate these using scalar loads/stores and a reconstruction of vectors from scalar. The loads are still marked with VMAT_GATHER_SCATTER to indicate that they are gather/scatters, however the vectorizer also asks the target to cost the instruction that generates the indexes for the emulated instructions. This is done by asking the target to cost vec_to_scalar and vec_construct with a stmt_vinfo being the VMAT_GATHER_SCATTER. Since Adv. SIMD does not have an LD1 variant that takes an Adv. SIMD Scalar element the operation is lowered entirely into a sequence of GPR loads to create the x registers for the indexes. At the moment however we don't cost these, and so the vectorizer things that when it emulates the instructions that it's much cheaper than using an actual gather/scatter with SVE. Consider: #define iterations 10 #define LEN_1D 32000 float a[LEN_1D], b[LEN_1D]; float s4115 (int *ip) { float sum = 0.; for (int i = 0; i < LEN_1D; i++) { sum += a[i] * b[ip[i]]; } return sum; } which before this patch with -mcpu= generates: .L2: add x3, x0, x1 ldrsw x4, [x0, x1] ldrsw x6, [x3, 4] ldpsw x3, x5, [x3, 8] ldr s1, [x2, x4, lsl 2] ldr s30, [x2, x6, lsl 2] ldr s31, [x2, x5, lsl 2] ldr s29, [x2, x3, lsl 2] uzp1v30.2s, v30.2s, v31.2s ldr q31, [x7, x1] add x1, x1, 16 uzp1v1.2s, v1.2s, v29.2s zip1v30.4s, v1.4s, v30.4s fmlav0.4s, v31.4s, v30.4s cmp x1, x8 bne .L2 but during costing: a[i_18] 1 times vector_load costs 4 in body *_4 1 times unaligned_load (misalign -1) costs 4 in body b[_5] 4 times vec_to_scalar costs 32 in body b[_5] 4 times scalar_load costs 16 in body b[_5] 1 times vec_construct costs 3 in body _1 * _6 1 times vector_stmt costs 2 in body _7 + sum_16 1 times scalar_to_vec costs 4 in prologue _7 + sum_16 1 times vector_stmt costs 2 in epilogue _7 + sum_16 1 times vec_to_scalar costs 4 in epilogue _7 + sum_16 1 times vector_stmt costs 2 in body Here we see that the latency for the vec_to_scalar is very high. We know the intermediate vector isn't usable by the target ISA and will always be elided. However these latencies need to remain high because when costing gather/scatters IFNs we still pass the nunits of the type along. In other words, the vectorizer is still costing vector gather/scatters as scalar load/stores. Lowering the cost for the emulated gathers would result in emulation being seemingly cheaper. So while the emulated costs are very high, they need to be higher than those for the IFN costing. i.e. the vectorizer generates: vect__5.9_8 = MEM [(intD.7 *)vectp_ip.7_14]; _35 = BIT_FIELD_REF ; _36 = (sizetype) _35; _37 = _36 * 4; _38 = _34 + _37; _39 = (voidD.55 *) _38; # VUSE <.MEM_10(D)> _40 = MEM[(floatD.32 *)_39]; which after IVopts is: _63 = &MEM [(int *)ip_11(D) + ivtmp.19_27 * 1]; _47 = BIT_FIELD_REF [(int *)_63], 32, 64>; _41 = BIT_FIELD_REF [(int *)_63], 32, 32>; _35 = BIT_FIELD_REF [(int *)_63], 32, 0>; _53 = BIT_FIELD_REF [(int *)_63], 32, 96>; Which we correctly lower in RTL to individual loads to avoid the repeated umov. As such, we should cost the vec_to_scalar as GPR loads and also do so for the throughput which we at the moment cost as: note: Vector issue estimate: note:load operations = 6 note:store operations = 0 note:general operations = 6 note:reduction latency = 2 note:estimated min cycles per iteration = 2.00 Which means 3 loads for the GOR indexes are missing, making it seem like the emulated loop has a much lower cycles per iter than it actually does since the bottleneck on the load units are not modelled. But worse, because the vectorizer costs gathers/scatters IFNs as scalar load/stores the number of loads required for an SVE gather is always much higher than the equivalent emulated variant. gcc/ChangeLog: PR target/118188 * config/aarch64/aarch64.cc (aarch64_vector_costs::count_ops): Adjust throughput of emu
[gcc r14-11199] AArch64: correct Cortex-X4 MIDR
https://gcc.gnu.org/g:26f78a4249b051c7755a44ba1ab1743f4133b0c2 commit r14-11199-g26f78a4249b051c7755a44ba1ab1743f4133b0c2 Author: Tamar Christina Date: Fri Jan 10 21:33:57 2025 + AArch64: correct Cortex-X4 MIDR The Parts Num field for the MIDR for Cortex-X4 is wrong. It's currently the parts number for a Cortex-A720 (which does have the right number). The correct number can be found in the Cortex-X4 Technical Reference Manual [1] on page 382 in Issue Number 5. [1] https://developer.arm.com/documentation/102484/latest/ gcc/ChangeLog: * config/aarch64/aarch64-cores.def (AARCH64_CORE): Fix cortex-x4 parts num. Diff: --- gcc/config/aarch64/aarch64-cores.def | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def index a919ab7d8a5a..b1eaf5512b57 100644 --- a/gcc/config/aarch64/aarch64-cores.def +++ b/gcc/config/aarch64/aarch64-cores.def @@ -185,7 +185,7 @@ AARCH64_CORE("cortex-x2", cortexx2, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8M AARCH64_CORE("cortex-x3", cortexx3, cortexa57, V9A, (SVE2_BITPERM, MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd4e, -1) -AARCH64_CORE("cortex-x4", cortexx4, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversen2, 0x41, 0xd81, -1) +AARCH64_CORE("cortex-x4", cortexx4, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversen2, 0x41, 0xd82, -1) AARCH64_CORE("cortex-x925", cortexx925, cortexa57, V9_2A, (SVE2_BITPERM, MEMTAG, PROFILE), neoversen2, 0x41, 0xd85, -1)
[gcc r15-7094] aarch64: Drop ILP32 from default elf multilibs after deprecation
https://gcc.gnu.org/g:9fd190c70976638eb8ae239f09d9f73da26d3021 commit r15-7094-g9fd190c70976638eb8ae239f09d9f73da26d3021 Author: Tamar Christina Date: Tue Jan 21 10:27:13 2025 + aarch64: Drop ILP32 from default elf multilibs after deprecation Following the deprecation of ILP32 *-elf builds fail now due to -Werror on the deprecation warning. This is because on embedded builds ILP32 is part of the default multilib. This patch removed it from the default target as the build would fail anyway. gcc/ChangeLog: * config.gcc (aarch64-*-elf): Drop ILP32 from default multilibs. Diff: --- gcc/config.gcc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/config.gcc b/gcc/config.gcc index c0e66a26f953..6f9f7313e132 100644 --- a/gcc/config.gcc +++ b/gcc/config.gcc @@ -1210,7 +1210,7 @@ aarch64*-*-elf | aarch64*-*-fuchsia* | aarch64*-*-rtems*) esac aarch64_multilibs="${with_multilib_list}" if test "$aarch64_multilibs" = "default"; then - aarch64_multilibs="lp64,ilp32" + aarch64_multilibs="lp64" fi aarch64_multilibs=`echo $aarch64_multilibs | sed -e 's/,/ /g'` for aarch64_multilib in ${aarch64_multilibs}; do
[gcc r15-7018] AArch64: Use standard names for saturating arithmetic
https://gcc.gnu.org/g:aa361611490947eb228e5b625a3f0f23ff647dbd commit r15-7018-gaa361611490947eb228e5b625a3f0f23ff647dbd Author: Akram Ahmad Date: Fri Jan 17 17:43:49 2025 + AArch64: Use standard names for saturating arithmetic This renames the existing {s,u}q{add,sub} instructions to use the standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and IFN_SAT_SUB. The NEON intrinsics for saturating arithmetic and their corresponding builtins are changed to use these standard names too. Using the standard names for the instructions causes 32 and 64-bit unsigned scalar saturating arithmetic to use the NEON instructions, resulting in an additional (and inefficient) FMOV to be generated when the original operands are in GP registers. This patch therefore also restores the original behaviour of using the adds/subs instructions in this circumstance. Additional tests are written for the scalar and Adv. SIMD cases to ensure that the correct instructions are used. The NEON intrinsics are already tested elsewhere. gcc/ChangeLog: * config/aarch64/aarch64-builtins.cc: Expand iterators. * config/aarch64/aarch64-simd-builtins.def: Use standard names * config/aarch64/aarch64-simd.md: Use standard names, split insn definitions on signedness of operator and type of operands. * config/aarch64/arm_neon.h: Use standard builtin names. * config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to simplify splitting of insn for unsigned scalar arithmetic. gcc/testsuite/ChangeLog: * gcc.target/aarch64/scalar_intrinsics.c: Update testcases. * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc: Template file for unsigned vector saturating arithmetic tests. * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c: 8-bit vector type tests. * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c: 16-bit vector type tests. * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c: 32-bit vector type tests. * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c: 64-bit vector type tests. * gcc.target/aarch64/saturating_arithmetic.inc: Template file for scalar saturating arithmetic tests. * gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests. * gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests. * gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests. * gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests. Co-authored-by: Tamar Christina Diff: --- gcc/config/aarch64/aarch64-builtins.cc | 12 + gcc/config/aarch64/aarch64-simd-builtins.def | 8 +- gcc/config/aarch64/aarch64-simd.md | 207 +++- gcc/config/aarch64/arm_neon.h | 96 gcc/config/aarch64/iterators.md| 4 + .../saturating_arithmetic_autovect.inc | 58 + .../saturating_arithmetic_autovect_1.c | 79 ++ .../saturating_arithmetic_autovect_2.c | 79 ++ .../saturating_arithmetic_autovect_3.c | 75 ++ .../saturating_arithmetic_autovect_4.c | 77 ++ .../aarch64/saturating-arithmetic-signed.c | 270 + .../gcc.target/aarch64/saturating_arithmetic.inc | 39 +++ .../gcc.target/aarch64/saturating_arithmetic_1.c | 36 +++ .../gcc.target/aarch64/saturating_arithmetic_2.c | 36 +++ .../gcc.target/aarch64/saturating_arithmetic_3.c | 30 +++ .../gcc.target/aarch64/saturating_arithmetic_4.c | 30 +++ .../gcc.target/aarch64/scalar_intrinsics.c | 32 +-- 17 files changed, 1096 insertions(+), 72 deletions(-) diff --git a/gcc/config/aarch64/aarch64-builtins.cc b/gcc/config/aarch64/aarch64-builtins.cc index 86eebc168859..6d5479c2e449 100644 --- a/gcc/config/aarch64/aarch64-builtins.cc +++ b/gcc/config/aarch64/aarch64-builtins.cc @@ -5039,6 +5039,18 @@ aarch64_general_gimple_fold_builtin (unsigned int fcode, gcall *stmt, new_stmt = gimple_build_assign (gimple_call_lhs (stmt), LSHIFT_EXPR, args[0], args[1]); break; + /* lower saturating add/sub neon builtins to gimple. */ + BUILTIN_VSDQ_I (BINOP, ssadd, 3, DEFAULT) + BUILTIN_VSDQ_I (BINOPU, usadd, 3, DEFAULT) + new_stmt = gimple_build_call_internal (IFN_SAT_ADD, 2, args[0], args[1]); + gimple_call_set_lhs (new_stmt, gimple_call_lhs (stmt)); + break; + BUILTIN_VSDQ_I (BINOP, sssub, 3, DEFAULT) + BUILTIN_VSDQ_I (BINOPU, ussub, 3, DEFAULT) + new_stmt = gimple_build_call_intern
[gcc r15-7015] Revert "AArch64: Use standard names for SVE saturating arithmetic"
https://gcc.gnu.org/g:8787f63de6e51bc43f86bb08c8a5f4a370246a90 commit r15-7015-g8787f63de6e51bc43f86bb08c8a5f4a370246a90 Author: Tamar Christina Date: Sat Jan 18 11:12:35 2025 + Revert "AArch64: Use standard names for SVE saturating arithmetic" This reverts commit 26b2d9f27ca24f0705641a85f29d179fa0600869. Diff: --- gcc/config/aarch64/aarch64-sve.md | 4 +- .../aarch64/sve/saturating_arithmetic.inc | 68 -- .../aarch64/sve/saturating_arithmetic_1.c | 60 --- .../aarch64/sve/saturating_arithmetic_2.c | 60 --- .../aarch64/sve/saturating_arithmetic_3.c | 62 .../aarch64/sve/saturating_arithmetic_4.c | 62 6 files changed, 2 insertions(+), 314 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index e975286a0190..ba4b4d904c77 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -4449,7 +4449,7 @@ ;; - ;; Unpredicated saturating signed addition and subtraction. -(define_insn "s3" +(define_insn "@aarch64_sve_" [(set (match_operand:SVE_FULL_I 0 "register_operand") (SBINQOPS:SVE_FULL_I (match_operand:SVE_FULL_I 1 "register_operand") @@ -4465,7 +4465,7 @@ ) ;; Unpredicated saturating unsigned addition and subtraction. -(define_insn "s3" +(define_insn "@aarch64_sve_" [(set (match_operand:SVE_FULL_I 0 "register_operand") (UBINQOPS:SVE_FULL_I (match_operand:SVE_FULL_I 1 "register_operand") diff --git a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc deleted file mode 100644 index 0b3ebbcb0d6f.. --- a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc +++ /dev/null @@ -1,68 +0,0 @@ -/* Template file for vector saturating arithmetic validation. - - This file defines saturating addition and subtraction functions for a given - scalar type, testing the auto-vectorization of these two operators. This - type, along with the corresponding minimum and maximum values for that type, - must be defined by any test file which includes this template file. */ - -#ifndef SAT_ARIT_AUTOVEC_INC -#define SAT_ARIT_AUTOVEC_INC - -#include -#include - -#ifndef UT -#define UT uint32_t -#define UMAX UINT_MAX -#define UMIN 0 -#endif - -void uaddq (UT *out, UT *a, UT *b, int n) -{ - for (int i = 0; i < n; i++) -{ - UT sum = a[i] + b[i]; - out[i] = sum < a[i] ? UMAX : sum; -} -} - -void uaddq2 (UT *out, UT *a, UT *b, int n) -{ - for (int i = 0; i < n; i++) -{ - UT sum; - if (!__builtin_add_overflow(a[i], b[i], &sum)) - out[i] = sum; - else - out[i] = UMAX; -} -} - -void uaddq_imm (UT *out, UT *a, int n) -{ - for (int i = 0; i < n; i++) -{ - UT sum = a[i] + 50; - out[i] = sum < a[i] ? UMAX : sum; -} -} - -void usubq (UT *out, UT *a, UT *b, int n) -{ - for (int i = 0; i < n; i++) -{ - UT sum = a[i] - b[i]; - out[i] = sum > a[i] ? UMIN : sum; -} -} - -void usubq_imm (UT *out, UT *a, int n) -{ - for (int i = 0; i < n; i++) -{ - UT sum = a[i] - 50; - out[i] = sum > a[i] ? UMIN : sum; -} -} - -#endif \ No newline at end of file diff --git a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c deleted file mode 100644 index 6936e9a27044.. --- a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c +++ /dev/null @@ -1,60 +0,0 @@ -/* { dg-do compile { target { aarch64*-*-* } } } */ -/* { dg-options "-O2 --save-temps -ftree-vectorize" } */ -/* { dg-final { check-function-bodies "**" "" "" } } */ - -/* -** uaddq: -** ... -** ld1b\tz([0-9]+)\.b, .* -** ld1b\tz([0-9]+)\.b, .* -** uqadd\tz\2.b, z\1\.b, z\2\.b -** ... -** ldr\tb([0-9]+), .* -** ldr\tb([0-9]+), .* -** uqadd\tb\4, b\3, b\4 -** ... -*/ -/* -** uaddq2: -** ... -** ld1b\tz([0-9]+)\.b, .* -** ld1b\tz([0-9]+)\.b, .* -** uqadd\tz\2.b, z\1\.b, z\2\.b -** ... -** ldr\tb([0-9]+), .* -** ldr\tb([0-9]+), .* -** uqadd\tb\4, b\3, b\4 -** ... -*/ -/* -** uaddq_imm: -** ... -** ld1b\tz([0-9]+)\.b, .* -** uqadd\tz\1.b, z\1\.b, #50 -** ... -** movi\tv([0-9]+)\.8b, 0x32 -** ... -** ldr\tb([0-9]+), .* -** uqadd\tb\3, b\3, b\2 -** ... -*/ -/* -** usubq: { xfail *-*-* } -** ... -** ld1b\tz([0-9]+)\.b, .* -** ld1b\tz([0-9]+)\.b, .* -** uqsub\tz\2.b, z\1\.b, z\2\.b -** ... -** ldr\tb([0-9]+), .* -** ldr\tb([0-9]+), .* -** uqsub\tb\4, b\3, b\4 -** ... -*/ - -#include - -#define UT unsigned char -#define UMAX UCHAR_MAX -#define UMIN 0 - -#include "saturating_arithmetic.inc" \ No newline at end of file diff --git a/gcc/testsuite/gcc
[gcc r15-7016] Revert "AArch64: Use standard names for saturating arithmetic"
https://gcc.gnu.org/g:1775a7280a230776927897147f1b07964cf5cfc7 commit r15-7016-g1775a7280a230776927897147f1b07964cf5cfc7 Author: Tamar Christina Date: Sat Jan 18 11:12:38 2025 + Revert "AArch64: Use standard names for saturating arithmetic" This reverts commit 5f5833a4107ddfbcd87651bf140151de043f4c36. Diff: --- gcc/config/aarch64/aarch64-builtins.cc | 12 - gcc/config/aarch64/aarch64-simd-builtins.def | 8 +- gcc/config/aarch64/aarch64-simd.md | 207 +--- gcc/config/aarch64/arm_neon.h | 96 gcc/config/aarch64/iterators.md| 4 - .../saturating_arithmetic_autovect.inc | 58 - .../saturating_arithmetic_autovect_1.c | 79 -- .../saturating_arithmetic_autovect_2.c | 79 -- .../saturating_arithmetic_autovect_3.c | 75 -- .../saturating_arithmetic_autovect_4.c | 77 -- .../aarch64/saturating-arithmetic-signed.c | 270 - .../gcc.target/aarch64/saturating_arithmetic.inc | 39 --- .../gcc.target/aarch64/saturating_arithmetic_1.c | 36 --- .../gcc.target/aarch64/saturating_arithmetic_2.c | 36 --- .../gcc.target/aarch64/saturating_arithmetic_3.c | 30 --- .../gcc.target/aarch64/saturating_arithmetic_4.c | 30 --- .../gcc.target/aarch64/scalar_intrinsics.c | 32 +-- 17 files changed, 72 insertions(+), 1096 deletions(-) diff --git a/gcc/config/aarch64/aarch64-builtins.cc b/gcc/config/aarch64/aarch64-builtins.cc index 6d5479c2e449..86eebc168859 100644 --- a/gcc/config/aarch64/aarch64-builtins.cc +++ b/gcc/config/aarch64/aarch64-builtins.cc @@ -5039,18 +5039,6 @@ aarch64_general_gimple_fold_builtin (unsigned int fcode, gcall *stmt, new_stmt = gimple_build_assign (gimple_call_lhs (stmt), LSHIFT_EXPR, args[0], args[1]); break; - /* lower saturating add/sub neon builtins to gimple. */ - BUILTIN_VSDQ_I (BINOP, ssadd, 3, DEFAULT) - BUILTIN_VSDQ_I (BINOPU, usadd, 3, DEFAULT) - new_stmt = gimple_build_call_internal (IFN_SAT_ADD, 2, args[0], args[1]); - gimple_call_set_lhs (new_stmt, gimple_call_lhs (stmt)); - break; - BUILTIN_VSDQ_I (BINOP, sssub, 3, DEFAULT) - BUILTIN_VSDQ_I (BINOPU, ussub, 3, DEFAULT) - new_stmt = gimple_build_call_internal (IFN_SAT_SUB, 2, args[0], args[1]); - gimple_call_set_lhs (new_stmt, gimple_call_lhs (stmt)); - break; - BUILTIN_VSDQ_I_DI (BINOP, sshl, 0, DEFAULT) BUILTIN_VSDQ_I_DI (BINOP_UUS, ushl, 0, DEFAULT) { diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def index 6cc45b18a723..286272a33118 100644 --- a/gcc/config/aarch64/aarch64-simd-builtins.def +++ b/gcc/config/aarch64/aarch64-simd-builtins.def @@ -71,10 +71,10 @@ BUILTIN_VSDQ_I (BINOP, sqrshl, 0, DEFAULT) BUILTIN_VSDQ_I (BINOP_UUS, uqrshl, 0, DEFAULT) /* Implemented by aarch64_. */ - BUILTIN_VSDQ_I (BINOP, ssadd, 3, DEFAULT) - BUILTIN_VSDQ_I (BINOPU, usadd, 3, DEFAULT) - BUILTIN_VSDQ_I (BINOP, sssub, 3, DEFAULT) - BUILTIN_VSDQ_I (BINOPU, ussub, 3, DEFAULT) + BUILTIN_VSDQ_I (BINOP, sqadd, 0, DEFAULT) + BUILTIN_VSDQ_I (BINOPU, uqadd, 0, DEFAULT) + BUILTIN_VSDQ_I (BINOP, sqsub, 0, DEFAULT) + BUILTIN_VSDQ_I (BINOPU, uqsub, 0, DEFAULT) /* Implemented by aarch64_qadd. */ BUILTIN_VSDQ_I (BINOP_SSU, suqadd, 0, DEFAULT) BUILTIN_VSDQ_I (BINOP_UUS, usqadd, 0, DEFAULT) diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index e2afe87e5130..eeb626f129a8 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -5162,214 +5162,15 @@ ) ;; q -(define_insn "s3" - [(set (match_operand:VSDQ_I_QI_HI 0 "register_operand" "=w") - (BINQOPS:VSDQ_I_QI_HI - (match_operand:VSDQ_I_QI_HI 1 "register_operand" "w") - (match_operand:VSDQ_I_QI_HI 2 "register_operand" "w")))] +(define_insn "aarch64_q" + [(set (match_operand:VSDQ_I 0 "register_operand" "=w") + (BINQOPS:VSDQ_I (match_operand:VSDQ_I 1 "register_operand" "w") + (match_operand:VSDQ_I 2 "register_operand" "w")))] "TARGET_SIMD" "q\\t%0, %1, %2" [(set_attr "type" "neon_q")] ) -(define_expand "s3" - [(parallel -[(set (match_operand:GPI 0 "register_operand") - (SBINQOPS:GPI (match_operand:GPI 1 "register_operand") - (match_operand:GPI 2 "aarch64_plus_operand"))) -(clobber (scratch:GPI)) -(clobber (reg:CC CC_REGNUM))])] -) - -;; Introducing a temporary GP reg allows signed saturating arithmetic with GPR -;; operands to be calculated without the use of costly transfers to and from FP -;; registers. For example, saturating addition usually uses three FMOVs: -;; -;; fmov d0, x0 -;; fmov d1, x1 -;; sqadd d0, d0, d1 -;; fmov x0, d0 -;; -;
[gcc r15-7017] AArch64: Use standard names for SVE saturating arithmetic
https://gcc.gnu.org/g:8f8ca83f2f6f165c4060ee1fc18ed3c74571ab7a commit r15-7017-g8f8ca83f2f6f165c4060ee1fc18ed3c74571ab7a Author: Akram Ahmad Date: Fri Jan 17 17:44:23 2025 + AArch64: Use standard names for SVE saturating arithmetic Rename the existing SVE unpredicated saturating arithmetic instructions to use standard names which are used by IFN_SAT_ADD and IFN_SAT_SUB. gcc/ChangeLog: * config/aarch64/aarch64-sve.md: Rename insns gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/saturating_arithmetic.inc: Template file for auto-vectorizer tests. * gcc.target/aarch64/sve/saturating_arithmetic_1.c: Instantiate 8-bit vector tests. * gcc.target/aarch64/sve/saturating_arithmetic_2.c: Instantiate 16-bit vector tests. * gcc.target/aarch64/sve/saturating_arithmetic_3.c: Instantiate 32-bit vector tests. * gcc.target/aarch64/sve/saturating_arithmetic_4.c: Instantiate 64-bit vector tests. Diff: --- gcc/config/aarch64/aarch64-sve.md | 4 +- .../aarch64/sve/saturating_arithmetic.inc | 68 ++ .../aarch64/sve/saturating_arithmetic_1.c | 60 +++ .../aarch64/sve/saturating_arithmetic_2.c | 60 +++ .../aarch64/sve/saturating_arithmetic_3.c | 62 .../aarch64/sve/saturating_arithmetic_4.c | 62 6 files changed, 314 insertions(+), 2 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index ba4b4d904c77..e975286a0190 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -4449,7 +4449,7 @@ ;; - ;; Unpredicated saturating signed addition and subtraction. -(define_insn "@aarch64_sve_" +(define_insn "s3" [(set (match_operand:SVE_FULL_I 0 "register_operand") (SBINQOPS:SVE_FULL_I (match_operand:SVE_FULL_I 1 "register_operand") @@ -4465,7 +4465,7 @@ ) ;; Unpredicated saturating unsigned addition and subtraction. -(define_insn "@aarch64_sve_" +(define_insn "s3" [(set (match_operand:SVE_FULL_I 0 "register_operand") (UBINQOPS:SVE_FULL_I (match_operand:SVE_FULL_I 1 "register_operand") diff --git a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc new file mode 100644 index ..0b3ebbcb0d6f --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic.inc @@ -0,0 +1,68 @@ +/* Template file for vector saturating arithmetic validation. + + This file defines saturating addition and subtraction functions for a given + scalar type, testing the auto-vectorization of these two operators. This + type, along with the corresponding minimum and maximum values for that type, + must be defined by any test file which includes this template file. */ + +#ifndef SAT_ARIT_AUTOVEC_INC +#define SAT_ARIT_AUTOVEC_INC + +#include +#include + +#ifndef UT +#define UT uint32_t +#define UMAX UINT_MAX +#define UMIN 0 +#endif + +void uaddq (UT *out, UT *a, UT *b, int n) +{ + for (int i = 0; i < n; i++) +{ + UT sum = a[i] + b[i]; + out[i] = sum < a[i] ? UMAX : sum; +} +} + +void uaddq2 (UT *out, UT *a, UT *b, int n) +{ + for (int i = 0; i < n; i++) +{ + UT sum; + if (!__builtin_add_overflow(a[i], b[i], &sum)) + out[i] = sum; + else + out[i] = UMAX; +} +} + +void uaddq_imm (UT *out, UT *a, int n) +{ + for (int i = 0; i < n; i++) +{ + UT sum = a[i] + 50; + out[i] = sum < a[i] ? UMAX : sum; +} +} + +void usubq (UT *out, UT *a, UT *b, int n) +{ + for (int i = 0; i < n; i++) +{ + UT sum = a[i] - b[i]; + out[i] = sum > a[i] ? UMIN : sum; +} +} + +void usubq_imm (UT *out, UT *a, int n) +{ + for (int i = 0; i < n; i++) +{ + UT sum = a[i] - 50; + out[i] = sum > a[i] ? UMIN : sum; +} +} + +#endif \ No newline at end of file diff --git a/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c new file mode 100644 index ..6936e9a27044 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/saturating_arithmetic_1.c @@ -0,0 +1,60 @@ +/* { dg-do compile { target { aarch64*-*-* } } } */ +/* { dg-options "-O2 --save-temps -ftree-vectorize" } */ +/* { dg-final { check-function-bodies "**" "" "" } } */ + +/* +** uaddq: +** ... +** ld1b\tz([0-9]+)\.b, .* +** ld1b\tz([0-9]+)\.b, .* +** uqadd\tz\2.b, z\1\.b, z\2\.b +** ... +** ldr\tb([0-9]+), .* +** ldr\tb([0-9]+), .* +** uqadd\tb\4, b\3, b\4 +** ... +*/ +/* +** uaddq2: +** ... +** ld1b\tz([0-9]+)\.b, .* +** ld1b\tz([0-9]+)\.b, .* +** uqadd\tz\2.b, z\1\.b,
[gcc r13-9351] AArch64: have -mcpu=native detect architecture extensions for unknown non-homogenous systems [PR1132
https://gcc.gnu.org/g:eb45b829bb3fb658aa34a340264dee9755d34e69 commit r13-9351-geb45b829bb3fb658aa34a340264dee9755d34e69 Author: Tamar Christina Date: Thu Jan 16 19:25:26 2025 + AArch64: have -mcpu=native detect architecture extensions for unknown non-homogenous systems [PR113257] in g:e91a17fe39c39e98cebe6e1cbc8064ee6846a3a7 we added the ability for -mcpu=native on unknown CPUs to still enable architecture extensions. This has worked great but was only added for homogenous systems. However the same thing works for big.LITTLE as in such system the cores must have the same extensions otherwise it doesn't fundamentally work. i.e. task migration from one core to the other wouldn't work. This extends the same handling to non-homogenous systems. gcc/ChangeLog: PR target/113257 * config/aarch64/driver-aarch64.cc (get_cpu_from_id, DEFAULT_CPU): New. (host_detect_local_cpu): Use it. gcc/testsuite/ChangeLog: PR target/113257 * gcc.target/aarch64/cpunative/info_34: New test. * gcc.target/aarch64/cpunative/native_cpu_34.c: New test. * gcc.target/aarch64/cpunative/info_35: New test. * gcc.target/aarch64/cpunative/native_cpu_35.c: New test. Co-authored-by: Richard Sandiford (cherry picked from commit 1ff85affe46623fe1a970de95887df22f4da9d16) Diff: --- gcc/config/aarch64/driver-aarch64.cc | 52 -- gcc/testsuite/gcc.target/aarch64/cpunative/info_34 | 18 gcc/testsuite/gcc.target/aarch64/cpunative/info_35 | 18 .../gcc.target/aarch64/cpunative/native_cpu_34.c | 12 + .../gcc.target/aarch64/cpunative/native_cpu_35.c | 13 ++ 5 files changed, 99 insertions(+), 14 deletions(-) diff --git a/gcc/config/aarch64/driver-aarch64.cc b/gcc/config/aarch64/driver-aarch64.cc index 8e318892b10a..ff4660f469cd 100644 --- a/gcc/config/aarch64/driver-aarch64.cc +++ b/gcc/config/aarch64/driver-aarch64.cc @@ -60,6 +60,7 @@ struct aarch64_core_data #define ALL_VARIANTS ((unsigned)-1) /* Default architecture to use if -mcpu=native did not detect a known CPU. */ #define DEFAULT_ARCH "8A" +#define DEFAULT_CPU "generic-armv8-a" #define AARCH64_CORE(CORE_NAME, CORE_IDENT, SCHED, ARCH, FLAGS, COSTS, IMP, PART, VARIANT) \ { CORE_NAME, #ARCH, IMP, PART, VARIANT, feature_deps::cpu_##CORE_IDENT }, @@ -106,6 +107,19 @@ get_arch_from_id (const char* id) return NULL; } +/* Return an aarch64_core_data for the cpu described + by ID, or NULL if ID describes something we don't know about. */ + +static const aarch64_core_data * +get_cpu_from_id (const char* name) +{ + for (unsigned i = 0; aarch64_cpu_data[i].name != NULL; i++) +if (strcmp (name, aarch64_cpu_data[i].name) == 0) + return &aarch64_cpu_data[i]; + + return NULL; +} + /* Check wether the CORE array is the same as the big.LITTLE BL_CORE. For an example CORE={0xd08, 0xd03} and BL_CORE=AARCH64_BIG_LITTLE (0xd08, 0xd03) will return true. */ @@ -394,18 +408,11 @@ host_detect_local_cpu (int argc, const char **argv) || variants[0] == aarch64_cpu_data[i].variant)) break; - if (aarch64_cpu_data[i].name == NULL) + if (arch) { - auto arch_info = get_arch_from_id (DEFAULT_ARCH); - - gcc_assert (arch_info); - - res = concat ("-march=", arch_info->name, NULL); - default_flags = arch_info->flags; - } - else if (arch) - { - const char *arch_id = aarch64_cpu_data[i].arch; + const char *arch_id = (aarch64_cpu_data[i].name +? aarch64_cpu_data[i].arch +: DEFAULT_ARCH); auto arch_info = get_arch_from_id (arch_id); /* We got some arch indentifier that's not in aarch64-arches.def? */ @@ -415,12 +422,15 @@ host_detect_local_cpu (int argc, const char **argv) res = concat ("-march=", arch_info->name, NULL); default_flags = arch_info->flags; } - else + else if (cpu || aarch64_cpu_data[i].name) { - default_flags = aarch64_cpu_data[i].flags; + auto cpu_info = (aarch64_cpu_data[i].name + ? &aarch64_cpu_data[i] + : get_cpu_from_id (DEFAULT_CPU)); + default_flags = cpu_info->flags; res = concat ("-m", cpu ? "cpu" : "tune", "=", - aarch64_cpu_data[i].name, + cpu_info->name, NULL); } } @@ -440,6 +450,20 @@ host_detect_local_cpu (int argc, const char **argv) break; } } + + /* On big.LITTLE if we find any unknown CPUs we can still pick arch +features as the cores should have the same features. So just pick +the feature flags from any of the cpus.
[gcc r13-9352] AArch64: don't override march to assembler with mcpu if march is specified [PR110901]
https://gcc.gnu.org/g:57a9595f05efe2839a39e711c6cf3ce21ca1ff33 commit r13-9352-g57a9595f05efe2839a39e711c6cf3ce21ca1ff33 Author: Tamar Christina Date: Thu Jan 16 19:23:50 2025 + AArch64: don't override march to assembler with mcpu if march is specified [PR110901] When both -mcpu and -march are specified, the value of -march wins out. This is done correctly for the calls to cc1 and for the assembler directives we put out in assembly files. However in the call to as we don't do this and instead use the arch from the cpu. This leads to a situation that GCC cannot reliably be used to compile assembly files which don't have a .arch directive. This is quite common with .S files which use macros to selectively enable codepath based on what the preprocessor sees. The fix is to change MCPU_TO_MARCH_SPEC to not override the march if an march is already specified. gcc/ChangeLog: PR target/110901 * config/aarch64/aarch64.h (MCPU_TO_MARCH_SPEC): Don't override if march is set. gcc/testsuite/ChangeLog: PR target/110901 * gcc.target/aarch64/options_set_29.c: New test. (cherry picked from commit 773beeaafb0ea31bd4e308b64781731d64b571ce) Diff: --- gcc/config/aarch64/aarch64.h | 2 +- gcc/testsuite/gcc.target/aarch64/options_set_29.c | 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h index 996a261334a6..77e40c17e354 100644 --- a/gcc/config/aarch64/aarch64.h +++ b/gcc/config/aarch64/aarch64.h @@ -1233,7 +1233,7 @@ extern const char *host_detect_local_cpu (int argc, const char **argv); CONFIG_TUNE_SPEC #define MCPU_TO_MARCH_SPEC \ - " %{mcpu=*:-march=%:rewrite_mcpu(%{mcpu=*:%*})}" + "%{!march=*:%{mcpu=*:-march=%:rewrite_mcpu(%{mcpu=*:%*})}}" extern const char *aarch64_rewrite_mcpu (int argc, const char **argv); #define MCPU_TO_MARCH_SPEC_FUNCTIONS \ diff --git a/gcc/testsuite/gcc.target/aarch64/options_set_29.c b/gcc/testsuite/gcc.target/aarch64/options_set_29.c new file mode 100644 index ..0a68550951ce --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/options_set_29.c @@ -0,0 +1,11 @@ +/* { dg-do assemble } */ +/* { dg-additional-options "-march=armv8.2-a+sve -mcpu=cortex-a72 -O1 -w -###" } */ + +int main () +{ + return 0; +} + +/* { dg-message "-march=armv8-a\+crc" "no arch from cpu" { xfail *-*-* } 0 } */ +/* { dg-message "-march=armv8\\.2-a\\+sve" "using only sve" { target *-*-* } 0 } */ +/* { dg-excess-errors "" } */
[gcc r14-11255] AArch64: don't override march to assembler with mcpu if march is specified [PR110901]
https://gcc.gnu.org/g:f8daec2ad9a20c31a98efb4602080e1e5d0c19fe commit r14-11255-gf8daec2ad9a20c31a98efb4602080e1e5d0c19fe Author: Tamar Christina Date: Thu Jan 16 19:23:50 2025 + AArch64: don't override march to assembler with mcpu if march is specified [PR110901] When both -mcpu and -march are specified, the value of -march wins out. This is done correctly for the calls to cc1 and for the assembler directives we put out in assembly files. However in the call to as we don't do this and instead use the arch from the cpu. This leads to a situation that GCC cannot reliably be used to compile assembly files which don't have a .arch directive. This is quite common with .S files which use macros to selectively enable codepath based on what the preprocessor sees. The fix is to change MCPU_TO_MARCH_SPEC to not override the march if an march is already specified. gcc/ChangeLog: PR target/110901 * config/aarch64/aarch64.h (MCPU_TO_MARCH_SPEC): Don't override if march is set. gcc/testsuite/ChangeLog: PR target/110901 * gcc.target/aarch64/options_set_29.c: New test. (cherry picked from commit 773beeaafb0ea31bd4e308b64781731d64b571ce) Diff: --- gcc/config/aarch64/aarch64.h | 2 +- gcc/testsuite/gcc.target/aarch64/options_set_29.c | 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h index 4fa1dfc79065..fe02a02a57b3 100644 --- a/gcc/config/aarch64/aarch64.h +++ b/gcc/config/aarch64/aarch64.h @@ -1448,7 +1448,7 @@ extern const char *host_detect_local_cpu (int argc, const char **argv); CONFIG_TUNE_SPEC #define MCPU_TO_MARCH_SPEC \ - " %{mcpu=*:-march=%:rewrite_mcpu(%{mcpu=*:%*})}" + "%{!march=*:%{mcpu=*:-march=%:rewrite_mcpu(%{mcpu=*:%*})}}" extern const char *aarch64_rewrite_mcpu (int argc, const char **argv); #define MCPU_TO_MARCH_SPEC_FUNCTIONS \ diff --git a/gcc/testsuite/gcc.target/aarch64/options_set_29.c b/gcc/testsuite/gcc.target/aarch64/options_set_29.c new file mode 100644 index ..0a68550951ce --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/options_set_29.c @@ -0,0 +1,11 @@ +/* { dg-do assemble } */ +/* { dg-additional-options "-march=armv8.2-a+sve -mcpu=cortex-a72 -O1 -w -###" } */ + +int main () +{ + return 0; +} + +/* { dg-message "-march=armv8-a\+crc" "no arch from cpu" { xfail *-*-* } 0 } */ +/* { dg-message "-march=armv8\\.2-a\\+sve" "using only sve" { target *-*-* } 0 } */ +/* { dg-excess-errors "" } */
[gcc r14-11254] AArch64: have -mcpu=native detect architecture extensions for unknown non-homogenous systems [PR1132
https://gcc.gnu.org/g:7c6fde4bac6c20e0b04c3feb820abe5ce0e48d9b commit r14-11254-g7c6fde4bac6c20e0b04c3feb820abe5ce0e48d9b Author: Tamar Christina Date: Thu Jan 16 19:25:26 2025 + AArch64: have -mcpu=native detect architecture extensions for unknown non-homogenous systems [PR113257] in g:e91a17fe39c39e98cebe6e1cbc8064ee6846a3a7 we added the ability for -mcpu=native on unknown CPUs to still enable architecture extensions. This has worked great but was only added for homogenous systems. However the same thing works for big.LITTLE as in such system the cores must have the same extensions otherwise it doesn't fundamentally work. i.e. task migration from one core to the other wouldn't work. This extends the same handling to non-homogenous systems. gcc/ChangeLog: PR target/113257 * config/aarch64/driver-aarch64.cc (get_cpu_from_id, DEFAULT_CPU): New. (host_detect_local_cpu): Use it. gcc/testsuite/ChangeLog: PR target/113257 * gcc.target/aarch64/cpunative/info_34: New test. * gcc.target/aarch64/cpunative/native_cpu_34.c: New test. * gcc.target/aarch64/cpunative/info_35: New test. * gcc.target/aarch64/cpunative/native_cpu_35.c: New test. Co-authored-by: Richard Sandiford (cherry picked from commit 1ff85affe46623fe1a970de95887df22f4da9d16) Diff: --- gcc/config/aarch64/driver-aarch64.cc | 52 -- gcc/testsuite/gcc.target/aarch64/cpunative/info_34 | 18 gcc/testsuite/gcc.target/aarch64/cpunative/info_35 | 18 .../gcc.target/aarch64/cpunative/native_cpu_34.c | 12 + .../gcc.target/aarch64/cpunative/native_cpu_35.c | 13 ++ 5 files changed, 99 insertions(+), 14 deletions(-) diff --git a/gcc/config/aarch64/driver-aarch64.cc b/gcc/config/aarch64/driver-aarch64.cc index b620351e5720..fa0c57e60749 100644 --- a/gcc/config/aarch64/driver-aarch64.cc +++ b/gcc/config/aarch64/driver-aarch64.cc @@ -60,6 +60,7 @@ struct aarch64_core_data #define ALL_VARIANTS ((unsigned)-1) /* Default architecture to use if -mcpu=native did not detect a known CPU. */ #define DEFAULT_ARCH "8A" +#define DEFAULT_CPU "generic-armv8-a" #define AARCH64_CORE(CORE_NAME, CORE_IDENT, SCHED, ARCH, FLAGS, COSTS, IMP, PART, VARIANT) \ { CORE_NAME, #ARCH, IMP, PART, VARIANT, feature_deps::cpu_##CORE_IDENT }, @@ -106,6 +107,19 @@ get_arch_from_id (const char* id) return NULL; } +/* Return an aarch64_core_data for the cpu described + by ID, or NULL if ID describes something we don't know about. */ + +static const aarch64_core_data * +get_cpu_from_id (const char* name) +{ + for (unsigned i = 0; aarch64_cpu_data[i].name != NULL; i++) +if (strcmp (name, aarch64_cpu_data[i].name) == 0) + return &aarch64_cpu_data[i]; + + return NULL; +} + /* Check wether the CORE array is the same as the big.LITTLE BL_CORE. For an example CORE={0xd08, 0xd03} and BL_CORE=AARCH64_BIG_LITTLE (0xd08, 0xd03) will return true. */ @@ -399,18 +413,11 @@ host_detect_local_cpu (int argc, const char **argv) || variants[0] == aarch64_cpu_data[i].variant)) break; - if (aarch64_cpu_data[i].name == NULL) + if (arch) { - auto arch_info = get_arch_from_id (DEFAULT_ARCH); - - gcc_assert (arch_info); - - res = concat ("-march=", arch_info->name, NULL); - default_flags = arch_info->flags; - } - else if (arch) - { - const char *arch_id = aarch64_cpu_data[i].arch; + const char *arch_id = (aarch64_cpu_data[i].name +? aarch64_cpu_data[i].arch +: DEFAULT_ARCH); auto arch_info = get_arch_from_id (arch_id); /* We got some arch indentifier that's not in aarch64-arches.def? */ @@ -420,12 +427,15 @@ host_detect_local_cpu (int argc, const char **argv) res = concat ("-march=", arch_info->name, NULL); default_flags = arch_info->flags; } - else + else if (cpu || aarch64_cpu_data[i].name) { - default_flags = aarch64_cpu_data[i].flags; + auto cpu_info = (aarch64_cpu_data[i].name + ? &aarch64_cpu_data[i] + : get_cpu_from_id (DEFAULT_CPU)); + default_flags = cpu_info->flags; res = concat ("-m", cpu ? "cpu" : "tune", "=", - aarch64_cpu_data[i].name, + cpu_info->name, NULL); } } @@ -445,6 +455,20 @@ host_detect_local_cpu (int argc, const char **argv) break; } } + + /* On big.LITTLE if we find any unknown CPUs we can still pick arch +features as the cores should have the same features. So just pick +the feature flags from any of the cpus
[gcc r15-7095] middle-end: use ncopies both when registering and reading masks [PR118273]
https://gcc.gnu.org/g:1dd79f44dfb64b441f3d6c64e7f909d73441bd05 commit r15-7095-g1dd79f44dfb64b441f3d6c64e7f909d73441bd05 Author: Tamar Christina Date: Tue Jan 21 10:29:08 2025 + middle-end: use ncopies both when registering and reading masks [PR118273] When registering masks for SIMD clone we end up using nmasks instead of nvectors where nmasks seems to compute the number of input masks required for the call given the current simdlen. This is however wrong as vect_record_loop_mask wants to know how many masks you want to create from the given vectype. i.e. which level of rgroups to create. This ends up mismatching with vect_get_loop_mask which uses nvectors and if the return type is narrower than the input types there will be a mismatch which causes us to try to read from the given rgroup. It only happens to work if the function had an additional argument that's wider or if all elements and return types are the same size. This fixes it by using nvectors during registration as well, which has already taken into account SLP and VF. gcc/ChangeLog: PR middle-end/118273 * tree-vect-stmts.cc (vectorizable_simd_clone_call): Use nvectors when doing mask registrations. gcc/testsuite/ChangeLog: PR middle-end/118273 * gcc.target/aarch64/vect-simd-clone-4.c: New test. Diff: --- gcc/testsuite/gcc.target/aarch64/vect-simd-clone-4.c | 15 +++ gcc/tree-vect-stmts.cc | 11 +++ 2 files changed, 18 insertions(+), 8 deletions(-) diff --git a/gcc/testsuite/gcc.target/aarch64/vect-simd-clone-4.c b/gcc/testsuite/gcc.target/aarch64/vect-simd-clone-4.c new file mode 100644 index ..9b52af703933 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/vect-simd-clone-4.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-std=c99" } */ +/* { dg-additional-options "-O3 -march=armv8-a" } */ + +#pragma GCC target ("+sve") + +extern char __attribute__ ((simd, const)) fn3 (short); +void test_fn3 (float *a, float *b, double *c, int n) +{ + for (int i = 0; i < n; ++i) +a[i] = fn3 (c[i]); +} + +/* { dg-final { scan-assembler {\s+_ZGVsMxv_fn3\n} } } */ + diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 833029fcb001..21fb5cf5bd47 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -4561,14 +4561,9 @@ vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info stmt_info, case SIMD_CLONE_ARG_TYPE_MASK: if (loop_vinfo && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) - { - unsigned nmasks - = exact_div (ncopies * bestn->simdclone->simdlen, -TYPE_VECTOR_SUBPARTS (vectype)).to_constant (); - vect_record_loop_mask (loop_vinfo, -&LOOP_VINFO_MASKS (loop_vinfo), -nmasks, vectype, op); - } + vect_record_loop_mask (loop_vinfo, + &LOOP_VINFO_MASKS (loop_vinfo), + ncopies, vectype, op); break; }