[PATCH] PR58669: does not detect all cpu cores/threads
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58669 Testing: $ /usr/lib/jvm/icedtea-6/bin/java TestProcessors Processors: 8 $ /usr/lib/jvm/gcj-jdk/bin/java -version java version "1.5.0" gij (GNU libgcj) version 4.8.1 $ /usr/lib/jvm/gcj-jdk/bin/java TestProcessors Processors: 1 $ /home/andrew/build/gcj/bin/gij -version java version "1.5.0" gij (GNU libgcj) version 4.9.0 20131013 (experimental) [trunk revision 203508] $ /home/andrew/build/gcj/bin/gij TestProcessors Processors: 8 ChangeLog: 2013-10-12 Andrew John Hughes * java/lang/natRuntime.cc: (availableProcessors()): Implement. Fixes PR gcc/58669. Ok for trunk and 4.8? -- Andrew :) Free Java Software Engineer Red Hat, Inc. (http://www.redhat.com) PGP Key: 248BDC07 (https://keys.indymedia.org/) Fingerprint = EC5A 1F5E C0AD 1D15 8F1F 8F91 3B96 A578 248B DC07 Index: libjava/java/lang/natRuntime.cc === --- libjava/java/lang/natRuntime.cc (revision 203508) +++ libjava/java/lang/natRuntime.cc (working copy) @@ -48,6 +48,10 @@ #include #endif +#ifdef HAVE_UNISTD_H +#include +#endif + #ifdef USE_LTDL @@ -303,8 +307,15 @@ jint java::lang::Runtime::availableProcessors (void) { - // FIXME: find the real value. - return 1; + long procs = -1; + +#ifdef HAVE_UNISTD_H + procs = sysconf(_SC_NPROCESSORS_ONLN); +#endif + + if (procs == -1) +return 1; + return (jint) procs; } jstring signature.asc Description: Digital signature
[COMMITTED/13] Fix PR 110386: backprop vs ABSU_EXPR
From: Andrew Pinski The issue here is that when backprop tries to go and strip sign ops, it skips over ABSU_EXPR but ABSU_EXPR not only does an ABS, it also changes the type to unsigned. Since strip_sign_op_1 is only supposed to strip off sign changing operands and not ones that change types, removing ABSU_EXPR here is correct. We don't handle nop conversions so this does cause any missed optimizations either. Committed to the GCC 13 branch after bootstrapped and tested on x86_64-linux-gnu with no regressions. PR tree-optimization/110386 gcc/ChangeLog: * gimple-ssa-backprop.cc (strip_sign_op_1): Remove ABSU_EXPR. gcc/testsuite/ChangeLog: * gcc.c-torture/compile/pr110386-1.c: New test. * gcc.c-torture/compile/pr110386-2.c: New test. (cherry picked from commit 2bbac12ea7bd8a3eef5382e1b13f6019df4ec03f) --- gcc/gimple-ssa-backprop.cc | 1 - gcc/testsuite/gcc.c-torture/compile/pr110386-1.c | 9 + gcc/testsuite/gcc.c-torture/compile/pr110386-2.c | 11 +++ 3 files changed, 20 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr110386-1.c create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr110386-2.c diff --git a/gcc/gimple-ssa-backprop.cc b/gcc/gimple-ssa-backprop.cc index 65a65590017..dcb15ed4f61 100644 --- a/gcc/gimple-ssa-backprop.cc +++ b/gcc/gimple-ssa-backprop.cc @@ -694,7 +694,6 @@ strip_sign_op_1 (tree rhs) switch (gimple_assign_rhs_code (assign)) { case ABS_EXPR: - case ABSU_EXPR: case NEGATE_EXPR: return gimple_assign_rhs1 (assign); diff --git a/gcc/testsuite/gcc.c-torture/compile/pr110386-1.c b/gcc/testsuite/gcc.c-torture/compile/pr110386-1.c new file mode 100644 index 000..4fcc977ad16 --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/compile/pr110386-1.c @@ -0,0 +1,9 @@ + +int f(int a) +{ +int c = c < 0 ? c : -c; +c = -c; +unsigned b = c; +unsigned t = b*a; +return t*t; +} diff --git a/gcc/testsuite/gcc.c-torture/compile/pr110386-2.c b/gcc/testsuite/gcc.c-torture/compile/pr110386-2.c new file mode 100644 index 000..c60e1b6994b --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/compile/pr110386-2.c @@ -0,0 +1,11 @@ +/* { dg-do compile { target i?86-*-* x86_64-*-* } } */ +/* { dg-options "-mavx" } */ + +#include + +__m128i do_stuff(__m128i XMM0) { + __m128i ABS0 = _mm_abs_epi32(XMM0); + __m128i MUL0 = _mm_mullo_epi32(ABS0, XMM0); + __m128i MUL1 = _mm_mullo_epi32(MUL0, MUL0); + return MUL1; +} -- 2.39.3
[COMMITTED/13] Fix PR 111331: wrong code for `a > 28 ? MIN : 29`
From: Andrew Pinski The problem here is after r6-7425-ga9fee7cdc3c62d0e51730, the comparison to see if the transformation could be done was using the wrong value. Instead of see if the inner was LE (for MIN and GE for MAX) the outer value, it was comparing the inner to the value used in the comparison which was wrong. Committed to GCC 13 branch after bootstrapped and tested on x86_64-linux-gnu. gcc/ChangeLog: PR tree-optimization/111331 * tree-ssa-phiopt.cc (minmax_replacement): Fix the LE/GE comparison for the `(a CMP CST1) ? max : a` optimization. gcc/testsuite/ChangeLog: PR tree-optimization/111331 * gcc.c-torture/execute/pr111331-1.c: New test. * gcc.c-torture/execute/pr111331-2.c: New test. * gcc.c-torture/execute/pr111331-3.c: New test. (cherry picked from commit 30e6ee074588bacefd2dfe745b188bb20c81fe5e) --- .../gcc.c-torture/execute/pr111331-1.c| 17 + .../gcc.c-torture/execute/pr111331-2.c| 19 +++ .../gcc.c-torture/execute/pr111331-3.c| 15 +++ gcc/tree-ssa-phiopt.cc| 8 4 files changed, 55 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/gcc.c-torture/execute/pr111331-1.c create mode 100644 gcc/testsuite/gcc.c-torture/execute/pr111331-2.c create mode 100644 gcc/testsuite/gcc.c-torture/execute/pr111331-3.c diff --git a/gcc/testsuite/gcc.c-torture/execute/pr111331-1.c b/gcc/testsuite/gcc.c-torture/execute/pr111331-1.c new file mode 100644 index 000..4c7f4fdbaa9 --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/execute/pr111331-1.c @@ -0,0 +1,17 @@ +int a; +int b; +int c(int d, int e, int f) { + if (d < e) +return e; + if (d > f) +return f; + return d; +} +int main() { + int g = -1; + a = c(b + 30, 29, g + 29); + volatile t = a; + if (t != 28) +__builtin_abort(); + return 0; +} diff --git a/gcc/testsuite/gcc.c-torture/execute/pr111331-2.c b/gcc/testsuite/gcc.c-torture/execute/pr111331-2.c new file mode 100644 index 000..5c677f2caa9 --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/execute/pr111331-2.c @@ -0,0 +1,19 @@ + +int a; +int b; + +int main() { + int d = b+30; + { +int t; +if (d < 29) + t = 29; +else + t = (d > 28) ? 28 : d; +a = t; + } + volatile int t = a; + if (a != 28) +__builtin_abort(); + return 0; +} diff --git a/gcc/testsuite/gcc.c-torture/execute/pr111331-3.c b/gcc/testsuite/gcc.c-torture/execute/pr111331-3.c new file mode 100644 index 000..213d9bdd539 --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/execute/pr111331-3.c @@ -0,0 +1,15 @@ +int a; +int b; + +int main() { + int d = b+30; + { +int t; +t = d < 29 ? 29 : ((d > 28) ? 28 : d); +a = t; + } + volatile int t = a; + if (a != 28) +__builtin_abort(); + return 0; +} diff --git a/gcc/tree-ssa-phiopt.cc b/gcc/tree-ssa-phiopt.cc index a7ab6ce4ad9..c3d78d1400b 100644 --- a/gcc/tree-ssa-phiopt.cc +++ b/gcc/tree-ssa-phiopt.cc @@ -2270,7 +2270,7 @@ minmax_replacement (basic_block cond_bb, basic_block middle_bb, basic_block alt_ /* We need BOUND <= LARGER. */ if (!integer_nonzerop (fold_build2 (LE_EXPR, boolean_type_node, - bound, larger))) + bound, arg_false))) return false; } else if (operand_equal_for_phi_arg_p (arg_false, smaller) @@ -2301,7 +2301,7 @@ minmax_replacement (basic_block cond_bb, basic_block middle_bb, basic_block alt_ /* We need BOUND >= SMALLER. */ if (!integer_nonzerop (fold_build2 (GE_EXPR, boolean_type_node, - bound, smaller))) + bound, arg_false))) return false; } else @@ -2341,7 +2341,7 @@ minmax_replacement (basic_block cond_bb, basic_block middle_bb, basic_block alt_ /* We need BOUND >= LARGER. */ if (!integer_nonzerop (fold_build2 (GE_EXPR, boolean_type_node, - bound, larger))) + bound, arg_true))) return false; } else if (operand_equal_for_phi_arg_p (arg_true, smaller) @@ -2368,7 +2368,7 @@ minmax_replacement (basic_block cond_bb, basic_block middle_bb, basic_block alt_ /* We need BOUND <= SMALLER. */ if (!integer_nonzerop (fold_build2 (LE_EXPR, boolean_type_node, - bound, smaller))) + bound, arg_true))) return false; } else -- 2.39.3
[COMMITTED] Return TRUE only when a global value is updated.
set_range_info should return TRUE only when it sets a new value. It was currently returning true whenever it set a value, whether it was different or not. With this change, VRP no longer overwrites global ranges DOM has set. 2 testcases needed adjusting that were expecting VRP2 to set a range but turns out it was really being set in DOM2. Instead they check for the range in the final listing... Bootstrapped on x86_64-pc-linux-gnu with no regressions. Pushed. Andrew From dae5de2a2353b928cc7099a78d88a40473abefd2 Mon Sep 17 00:00:00 2001 From: Andrew MacLeod Date: Wed, 27 Sep 2023 12:34:16 -0400 Subject: [PATCH 1/5] Return TRUE only when a global value is updated. set_range_info should return TRUE only when it sets a new value. VRP no longer overwrites global ranges DOM has set. Check for ranges in the final listing. gcc/ * tree-ssanames.cc (set_range_info): Return true only if the current value changes. gcc/testsuite/ * gcc.dg/pr93917.c: Check for ranges in final optimized listing. * gcc.dg/tree-ssa/vrp-unreachable.c: Ditto. --- gcc/testsuite/gcc.dg/pr93917.c| 4 ++-- .../gcc.dg/tree-ssa/vrp-unreachable.c | 4 ++-- gcc/tree-ssanames.cc | 24 +-- 3 files changed, 16 insertions(+), 16 deletions(-) diff --git a/gcc/testsuite/gcc.dg/pr93917.c b/gcc/testsuite/gcc.dg/pr93917.c index f09e1c41ae8..f636b77f45d 100644 --- a/gcc/testsuite/gcc.dg/pr93917.c +++ b/gcc/testsuite/gcc.dg/pr93917.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-vrp2" } */ +/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-vrp2 -fdump-tree-optimized-alias" } */ void f3(int n); @@ -19,5 +19,5 @@ void f2(int*n) /* { dg-final { scan-tree-dump-times "Global Export.*0, \\+INF" 1 "vrp1" } } */ /* { dg-final { scan-tree-dump-times "__builtin_unreachable" 1 "vrp1" } } */ -/* { dg-final { scan-tree-dump-times "Global Export.*0, \\+INF" 1 "vrp2" } } */ /* { dg-final { scan-tree-dump-times "__builtin_unreachable" 0 "vrp2" } } */ +/* { dg-final { scan-tree-dump-times "0, \\+INF" 2 "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vrp-unreachable.c b/gcc/testsuite/gcc.dg/tree-ssa/vrp-unreachable.c index 5835dfc8dbc..4aad7f1be5d 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/vrp-unreachable.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/vrp-unreachable.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-vrp1-alias -fdump-tree-vrp2-alias" } */ +/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-vrp2 -fdump-tree-optimized-alias" } */ void dead (unsigned n); void alive (unsigned n); @@ -39,4 +39,4 @@ void func (unsigned n, unsigned m) /* { dg-final { scan-tree-dump-not "dead" "vrp1" } } */ /* { dg-final { scan-tree-dump-times "builtin_unreachable" 1 "vrp1" } } */ /* { dg-final { scan-tree-dump-not "builtin_unreachable" "vrp2" } } */ -/* { dg-final { scan-tree-dump-times "fff8 VALUE 0x0" 4 "vrp2" } } */ +/* { dg-final { scan-tree-dump-times "fff8 VALUE 0x0" 2 "optimized" } } */ diff --git a/gcc/tree-ssanames.cc b/gcc/tree-ssanames.cc index 23387b90fe3..1eae411ac1c 100644 --- a/gcc/tree-ssanames.cc +++ b/gcc/tree-ssanames.cc @@ -418,10 +418,17 @@ set_range_info (tree name, const vrange &r) if (r.undefined_p () || r.varying_p ()) return false; + // Pick up the current range, or VARYING if none. tree type = TREE_TYPE (name); + Value_Range tmp (type); + if (range_info_p (name)) +range_info_get_range (name, tmp); + else +tmp.set_varying (type); + if (POINTER_TYPE_P (type)) { - if (r.nonzero_p ()) + if (r.nonzero_p () && !tmp.nonzero_p ()) { set_ptr_nonnull (name); return true; @@ -429,18 +436,11 @@ set_range_info (tree name, const vrange &r) return false; } - /* If a global range already exists, incorporate it. */ - if (range_info_p (name)) -{ - Value_Range tmp (type); - range_info_get_range (name, tmp); - tmp.intersect (r); - if (tmp.undefined_p ()) - return false; + // If the result doesn't change, or is undefined, return false. + if (!tmp.intersect (r) || tmp.undefined_p ()) +return false; - return range_info_set_range (name, tmp); -} - return range_info_set_range (name, r); + return range_info_set_range (name, tmp); } /* Set nonnull attribute to pointer NAME. */ -- 2.41.0
[COMMITTED] Remove pass counting in VRP.
Pass counting in VRP is used to decide when to call early VRP, pass the flag to enable warnings, and when the final pass is. If you try to add additional passes, this becomes quite fragile. This patch simply chooses the pass based on the data pointer passed in, and remove the pass counter. The first FULL VRP pass invokes the warning code, and the flag passed in now represents the FINAL pass of VRP. There is no longer a global flag which, as it turns out, wasn't working well with the JIT compiler, but when undetected. (Thanks to dmalcolm for helping me sort out what was going on there) Bootstraps on x86_64-pc-linux-gnu with no regressions. Pushed. Andrew From 29abc475a360ad14d5f692945f2805fba1fdc679 Mon Sep 17 00:00:00 2001 From: Andrew MacLeod Date: Thu, 28 Sep 2023 09:19:32 -0400 Subject: [PATCH 2/5] Remove pass counting in VRP. Rather than using a pass count to decide which parameters are passed to VRP, makemit explicit. * passes.def (pass_vrp): Use parameter for final pass flag.. * tree-vrp.cc (vrp_pass_num): Remove. (run_warning_pass): New. (pass_vrp::my_pass): Remove. (pass_vrp::final_p): New. (pass_vrp::set_pass_param): Set final_p param. (pass_vrp::execute): Choose specific pass based on data pointer. --- gcc/passes.def | 4 ++-- gcc/tree-vrp.cc | 26 +- 2 files changed, 19 insertions(+), 11 deletions(-) diff --git a/gcc/passes.def b/gcc/passes.def index 4110a472914..2bafd60bbfb 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -221,7 +221,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_fre, true /* may_iterate */); NEXT_PASS (pass_merge_phi); NEXT_PASS (pass_thread_jumps_full, /*first=*/true); - NEXT_PASS (pass_vrp, true /* warn_array_bounds_p */); + NEXT_PASS (pass_vrp, false /* final_p*/); NEXT_PASS (pass_dse); NEXT_PASS (pass_dce); /* pass_stdarg is always run and at this point we execute @@ -348,7 +348,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); NEXT_PASS (pass_strlen); NEXT_PASS (pass_thread_jumps_full, /*first=*/false); - NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */); + NEXT_PASS (pass_vrp, true /* final_p */); /* Run CCP to compute alignment and nonzero bits. */ NEXT_PASS (pass_ccp, true /* nonzero_p */); NEXT_PASS (pass_warn_restrict); diff --git a/gcc/tree-vrp.cc b/gcc/tree-vrp.cc index d7b194f5904..05266dfe34a 100644 --- a/gcc/tree-vrp.cc +++ b/gcc/tree-vrp.cc @@ -1120,36 +1120,44 @@ const pass_data pass_data_early_vrp = ( TODO_cleanup_cfg | TODO_update_ssa | TODO_verify_all ), }; -static int vrp_pass_num = 0; +static bool run_warning_pass = true; class pass_vrp : public gimple_opt_pass { public: pass_vrp (gcc::context *ctxt, const pass_data &data_) -: gimple_opt_pass (data_, ctxt), data (data_), warn_array_bounds_p (false), - my_pass (vrp_pass_num++) - {} +: gimple_opt_pass (data_, ctxt), data (data_), + warn_array_bounds_p (false), final_p (false) + { +// Only the frst VRP pass should run warnings. +if (&data == &pass_data_vrp) + { + warn_array_bounds_p = run_warning_pass; + run_warning_pass = false; + } + } /* opt_pass methods: */ opt_pass * clone () final override { return new pass_vrp (m_ctxt, data); } void set_pass_param (unsigned int n, bool param) final override { gcc_assert (n == 0); - warn_array_bounds_p = param; + final_p = param; } bool gate (function *) final override { return flag_tree_vrp != 0; } unsigned int execute (function *fun) final override { // Early VRP pass. - if (my_pass == 0) - return execute_ranger_vrp (fun, /*warn_array_bounds_p=*/false, false); + if (&data == &pass_data_early_vrp) + return execute_ranger_vrp (fun, /*warn_array_bounds_p=*/false, + /*final_p=*/false); - return execute_ranger_vrp (fun, warn_array_bounds_p, my_pass == 2); + return execute_ranger_vrp (fun, warn_array_bounds_p, final_p); } private: const pass_data &data; bool warn_array_bounds_p; - int my_pass; + bool final_p; }; // class pass_vrp const pass_data pass_data_assumptions = -- 2.41.0
Re: [COMMITTED] Return TRUE only when a global value is updated.
huh. thanks, I'll have a look. Andrew On 10/3/23 11:47, David Edelsohn wrote: This patch caused a bootstrap failure on AIX. during GIMPLE pass: evrp /nasfarm/edelsohn/src/src/libgcc/libgcc2.c: In function '__gcc_bcmp': /nasfarm/edelsohn/src/src/libgcc/libgcc2.c:2910:1: internal compiler error: in get_irange, at value-range-storage.cc:343 2910 | } | ^ 0x11b7f4b7 irange_storage::get_irange(irange&, tree_node*) const /nasfarm/edelsohn/src/src/gcc/value-range-storage.cc:343 0x11b7e7af vrange_storage::get_vrange(vrange&, tree_node*) const /nasfarm/edelsohn/src/src/gcc/value-range-storage.cc:178 0x139f3d77 range_info_get_range(tree_node const*, vrange&) /nasfarm/edelsohn/src/src/gcc/tree-ssanames.cc:118 0x1134b463 set_range_info(tree_node*, vrange const&) /nasfarm/edelsohn/src/src/gcc/tree-ssanames.cc:425 0x116a7333 gimple_ranger::register_inferred_ranges(gimple*) /nasfarm/edelsohn/src/src/gcc/gimple-range.cc:487 0x125cef27 rvrp_folder::fold_stmt(gimple_stmt_iterator*) /nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1033 0x123dd063 substitute_and_fold_dom_walker::before_dom_children(basic_block_def*) /nasfarm/edelsohn/src/src/gcc/tree-ssa-propagate.cc:876 0x1176cc43 dom_walker::walk(basic_block_def*) /nasfarm/edelsohn/src/src/gcc/domwalk.cc:311 0x123dd733 substitute_and_fold_engine::substitute_and_fold(basic_block_def*) /nasfarm/edelsohn/src/src/gcc/tree-ssa-propagate.cc:999 0x123d0f5f execute_ranger_vrp(function*, bool, bool) /nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1062 0x123d14ef execute /nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1142
Re: [COMMITTED] Return TRUE only when a global value is updated.
Give this a try.. I'm testing it here, but x86 doesn't seem to show it anyway for some reason :-P I think i needed to handle pointers special since SSA_NAMES handle pointer ranges different. Andrew On 10/3/23 11:47, David Edelsohn wrote: This patch caused a bootstrap failure on AIX. during GIMPLE pass: evrp /nasfarm/edelsohn/src/src/libgcc/libgcc2.c: In function '__gcc_bcmp': /nasfarm/edelsohn/src/src/libgcc/libgcc2.c:2910:1: internal compiler error: in get_irange, at value-range-storage.cc:343 2910 | } | ^ 0x11b7f4b7 irange_storage::get_irange(irange&, tree_node*) const /nasfarm/edelsohn/src/src/gcc/value-range-storage.cc:343 0x11b7e7af vrange_storage::get_vrange(vrange&, tree_node*) const /nasfarm/edelsohn/src/src/gcc/value-range-storage.cc:178 0x139f3d77 range_info_get_range(tree_node const*, vrange&) /nasfarm/edelsohn/src/src/gcc/tree-ssanames.cc:118 0x1134b463 set_range_info(tree_node*, vrange const&) /nasfarm/edelsohn/src/src/gcc/tree-ssanames.cc:425 0x116a7333 gimple_ranger::register_inferred_ranges(gimple*) /nasfarm/edelsohn/src/src/gcc/gimple-range.cc:487 0x125cef27 rvrp_folder::fold_stmt(gimple_stmt_iterator*) /nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1033 0x123dd063 substitute_and_fold_dom_walker::before_dom_children(basic_block_def*) /nasfarm/edelsohn/src/src/gcc/tree-ssa-propagate.cc:876 0x1176cc43 dom_walker::walk(basic_block_def*) /nasfarm/edelsohn/src/src/gcc/domwalk.cc:311 0x123dd733 substitute_and_fold_engine::substitute_and_fold(basic_block_def*) /nasfarm/edelsohn/src/src/gcc/tree-ssa-propagate.cc:999 0x123d0f5f execute_ranger_vrp(function*, bool, bool) /nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1062 0x123d14ef execute /nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1142 diff --git a/gcc/tree-ssanames.cc b/gcc/tree-ssanames.cc index 1eae411ac1c..1401f67c781 100644 --- a/gcc/tree-ssanames.cc +++ b/gcc/tree-ssanames.cc @@ -420,15 +420,11 @@ set_range_info (tree name, const vrange &r) // Pick up the current range, or VARYING if none. tree type = TREE_TYPE (name); - Value_Range tmp (type); - if (range_info_p (name)) -range_info_get_range (name, tmp); - else -tmp.set_varying (type); - if (POINTER_TYPE_P (type)) { - if (r.nonzero_p () && !tmp.nonzero_p ()) + struct ptr_info_def *pi = get_ptr_info (name); + // If R is nonnull and pi is not, set nonnull. + if (r.nonzero_p () && (!pi || !pi->pt.null)) { set_ptr_nonnull (name); return true; @@ -436,6 +432,11 @@ set_range_info (tree name, const vrange &r) return false; } + Value_Range tmp (type); + if (range_info_p (name)) +range_info_get_range (name, tmp); + else +tmp.set_varying (type); // If the result doesn't change, or is undefined, return false. if (!tmp.intersect (r) || tmp.undefined_p ()) return false;
Re: [COMMITTED] Return TRUE only when a global value is updated.
perfect. I'll check it in when my testrun is done. Thanks .. . and sorry :-) Andrew On 10/3/23 12:53, David Edelsohn wrote: AIX bootstrap is happier with the patch. Thanks, David On Tue, Oct 3, 2023 at 12:30 PM Andrew MacLeod wrote: Give this a try.. I'm testing it here, but x86 doesn't seem to show it anyway for some reason :-P I think i needed to handle pointers special since SSA_NAMES handle pointer ranges different. Andrew On 10/3/23 11:47, David Edelsohn wrote: > This patch caused a bootstrap failure on AIX. > > during GIMPLE pass: evrp > > /nasfarm/edelsohn/src/src/libgcc/libgcc2.c: In function '__gcc_bcmp': > > /nasfarm/edelsohn/src/src/libgcc/libgcc2.c:2910:1: internal compiler > error: in get_irange, at value-range-storage.cc:343 > > 2910 | } > > | ^ > > > 0x11b7f4b7 irange_storage::get_irange(irange&, tree_node*) const > > /nasfarm/edelsohn/src/src/gcc/value-range-storage.cc:343 > > 0x11b7e7af vrange_storage::get_vrange(vrange&, tree_node*) const > > /nasfarm/edelsohn/src/src/gcc/value-range-storage.cc:178 > > 0x139f3d77 range_info_get_range(tree_node const*, vrange&) > > /nasfarm/edelsohn/src/src/gcc/tree-ssanames.cc:118 > > 0x1134b463 set_range_info(tree_node*, vrange const&) > > /nasfarm/edelsohn/src/src/gcc/tree-ssanames.cc:425 > > 0x116a7333 gimple_ranger::register_inferred_ranges(gimple*) > > /nasfarm/edelsohn/src/src/gcc/gimple-range.cc:487 > > 0x125cef27 rvrp_folder::fold_stmt(gimple_stmt_iterator*) > > /nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1033 > > 0x123dd063 > substitute_and_fold_dom_walker::before_dom_children(basic_block_def*) > > /nasfarm/edelsohn/src/src/gcc/tree-ssa-propagate.cc:876 > > 0x1176cc43 dom_walker::walk(basic_block_def*) > > /nasfarm/edelsohn/src/src/gcc/domwalk.cc:311 > > 0x123dd733 > substitute_and_fold_engine::substitute_and_fold(basic_block_def*) > > /nasfarm/edelsohn/src/src/gcc/tree-ssa-propagate.cc:999 > > 0x123d0f5f execute_ranger_vrp(function*, bool, bool) > > /nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1062 > > 0x123d14ef execute > > /nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1142 >
Re: [COMMITTED] Remove pass counting in VRP.
On 10/3/23 13:02, David Malcolm wrote: On Tue, 2023-10-03 at 10:32 -0400, Andrew MacLeod wrote: Pass counting in VRP is used to decide when to call early VRP, pass the flag to enable warnings, and when the final pass is. If you try to add additional passes, this becomes quite fragile. This patch simply chooses the pass based on the data pointer passed in, and remove the pass counter. The first FULL VRP pass invokes the warning code, and the flag passed in now represents the FINAL pass of VRP. There is no longer a global flag which, as it turns out, wasn't working well with the JIT compiler, but when undetected. (Thanks to dmalcolm for helping me sort out what was going on there) Bootstraps on x86_64-pc-linux-gnu with no regressions. Pushed. [CCing jit mailing list] I'm worried that this patch may have "papered over" an issue with libgccjit. Specifically: well, that isnt the patch that was checked in :-P Im not sure how the old version got into the commit note. Attached is the version checked in. commit 7eb5ce7f58ed4a48641e1786e4fdeb2f7fb8c5ff Author: Andrew MacLeod Date: Thu Sep 28 09:19:32 2023 -0400 Remove pass counting in VRP. Rather than using a pass count to decide which parameters are passed to VRP, makemit explicit. * passes.def (pass_vrp): Pass "final pass" flag as parameter. * tree-vrp.cc (vrp_pass_num): Remove. (pass_vrp::my_pass): Remove. (pass_vrp::pass_vrp): Add warn_p as a parameter. (pass_vrp::final_p): New. (pass_vrp::set_pass_param): Set final_p param. (pass_vrp::execute): Call execute_range_vrp with no conditions. (make_pass_vrp): Pass additional parameter. (make_pass_early_vrp): Ditto. diff --git a/gcc/passes.def b/gcc/passes.def index 4110a472914..2bafd60bbfb 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -221,7 +221,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_fre, true /* may_iterate */); NEXT_PASS (pass_merge_phi); NEXT_PASS (pass_thread_jumps_full, /*first=*/true); - NEXT_PASS (pass_vrp, true /* warn_array_bounds_p */); + NEXT_PASS (pass_vrp, false /* final_p*/); NEXT_PASS (pass_dse); NEXT_PASS (pass_dce); /* pass_stdarg is always run and at this point we execute @@ -348,7 +348,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); NEXT_PASS (pass_strlen); NEXT_PASS (pass_thread_jumps_full, /*first=*/false); - NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */); + NEXT_PASS (pass_vrp, true /* final_p */); /* Run CCP to compute alignment and nonzero bits. */ NEXT_PASS (pass_ccp, true /* nonzero_p */); NEXT_PASS (pass_warn_restrict); diff --git a/gcc/tree-vrp.cc b/gcc/tree-vrp.cc index d7b194f5904..4f8c7745461 100644 --- a/gcc/tree-vrp.cc +++ b/gcc/tree-vrp.cc @@ -1120,36 +1120,32 @@ const pass_data pass_data_early_vrp = ( TODO_cleanup_cfg | TODO_update_ssa | TODO_verify_all ), }; -static int vrp_pass_num = 0; class pass_vrp : public gimple_opt_pass { public: - pass_vrp (gcc::context *ctxt, const pass_data &data_) -: gimple_opt_pass (data_, ctxt), data (data_), warn_array_bounds_p (false), - my_pass (vrp_pass_num++) - {} + pass_vrp (gcc::context *ctxt, const pass_data &data_, bool warn_p) +: gimple_opt_pass (data_, ctxt), data (data_), + warn_array_bounds_p (warn_p), final_p (false) +{ } /* opt_pass methods: */ - opt_pass * clone () final override { return new pass_vrp (m_ctxt, data); } + opt_pass * clone () final override +{ return new pass_vrp (m_ctxt, data, false); } void set_pass_param (unsigned int n, bool param) final override { gcc_assert (n == 0); - warn_array_bounds_p = param; + final_p = param; } bool gate (function *) final override { return flag_tree_vrp != 0; } unsigned int execute (function *fun) final override { - // Early VRP pass. - if (my_pass == 0) - return execute_ranger_vrp (fun, /*warn_array_bounds_p=*/false, false); - - return execute_ranger_vrp (fun, warn_array_bounds_p, my_pass == 2); + return execute_ranger_vrp (fun, warn_array_bounds_p, final_p); } private: const pass_data &data; bool warn_array_bounds_p; - int my_pass; + bool final_p; }; // class pass_vrp const pass_data pass_data_assumptions = @@ -1219,13 +1215,13 @@ public: gimple_opt_pass * make_pass_vrp (gcc::context *ctxt) { - return new pass_vrp (ctxt, pass_data_vrp); + return new pass_vrp (ctxt, pass_data_vrp, true); } gimple_opt_pass * make_pass_early_vrp (gcc::context *ctxt) { - return new pass_vrp (ctxt, pass_data_early_vrp); + return new pass_vrp (ctxt, pass_data_early_vrp, false); } gimple_opt_pass *
[COMMITTED] Don't use range_info_get_range for pointers.
Properly check for pointers instead of just using range_info_get_range. bootstrapped on 86_64-pc-linux-gnu (and presumably AIX too :-) with no regressions. On 10/3/23 12:53, David Edelsohn wrote: AIX bootstrap is happier with the patch. Thanks, David commit d8808c37d29110872fa51b98e71aef9e160b4692 Author: Andrew MacLeod Date: Tue Oct 3 12:32:10 2023 -0400 Don't use range_info_get_range for pointers. Pointers only track null and nonnull, so we need to handle them specially. * tree-ssanames.cc (set_range_info): Use get_ptr_info for pointers rather than range_info_get_range. diff --git a/gcc/tree-ssanames.cc b/gcc/tree-ssanames.cc index 1eae411ac1c..0a32444fbdf 100644 --- a/gcc/tree-ssanames.cc +++ b/gcc/tree-ssanames.cc @@ -420,15 +420,11 @@ set_range_info (tree name, const vrange &r) // Pick up the current range, or VARYING if none. tree type = TREE_TYPE (name); - Value_Range tmp (type); - if (range_info_p (name)) -range_info_get_range (name, tmp); - else -tmp.set_varying (type); - if (POINTER_TYPE_P (type)) { - if (r.nonzero_p () && !tmp.nonzero_p ()) + struct ptr_info_def *pi = get_ptr_info (name); + // If R is nonnull and pi is not, set nonnull. + if (r.nonzero_p () && (!pi || pi->pt.null)) { set_ptr_nonnull (name); return true; @@ -436,6 +432,11 @@ set_range_info (tree name, const vrange &r) return false; } + Value_Range tmp (type); + if (range_info_p (name)) +range_info_get_range (name, tmp); + else +tmp.set_varying (type); // If the result doesn't change, or is undefined, return false. if (!tmp.intersect (r) || tmp.undefined_p ()) return false;
Re: [PATCH] ipa: Self-DCE of uses of removed call LHSs (PR 108007)
On Wed, Oct 4, 2023 at 5:08 PM Maciej W. Rozycki wrote: > > On Tue, 3 Oct 2023, Martin Jambor wrote: > > > > SSA graph may be deep so this may cause stack overflow, so I think we > > > should use worklist here (it is also easy to do). > > > > > > OK with that change. > > > Honza > > > > I have just committed the following after a bootstrap and testing on > > x86_64-linux. > > This has regressed the native `powerpc64le-linux-gnu' configuration, > which doesn't bootstrap here anymore: > > Comparing stages 2 and 3 > Bootstrap comparison failure! > powerpc64le-linux-gnu/libstdc++-v3/src/compatibility-ldbl.o differs > powerpc64le-linux-gnu/libstdc++-v3/src/.libs/compatibility-ldbl.o differs > > I have double-checked this is indeed the offending commit, the compiler > bootstraps just fine as at commit 7eb5ce7f58ed ("Remove pass counting in > VRP."). > > Shall I file a PR, or can you handle it regardless? Let me know if you > need anything from me. It is already filed as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111688 . Thanks, Andrew > > Maciej
Re: [PATCH]AArch64 Handle copysign (x, -1) expansion efficiently
_const_vec_duplicate > (operands[2])); > + if (-1 == real_to_integer (r0)) Likewise. > + { > + emit_insn (gen_ior3 (int_res, arg1, v_sign_bitmask)); > + emit_move_insn (operands[0], gen_lowpart (mode, int_res)); > + DONE; > + } > + } > + > +operands[2] = force_reg (mode, operands[2]); > +emit_insn (gen_and3 (sign, arg2, v_sign_bitmask)); > emit_insn (gen_and3 >(mant, arg1, > aarch64_simd_gen_const_vector_dup (mode, > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md > index > 24349ecdbbab875f21975f116732a9e53762d4c1..d6c581ad81615b4feb095391cbcf4f5b78fa72f1 > 100644 > --- a/gcc/config/aarch64/aarch64.md > +++ b/gcc/config/aarch64/aarch64.md > @@ -6940,12 +6940,25 @@ (define_expand "lrint2" > (define_expand "copysign3" >[(match_operand:GPF 0 "register_operand") > (match_operand:GPF 1 "register_operand") > - (match_operand:GPF 2 "register_operand")] > + (match_operand:GPF 2 "nonmemory_operand")] >"TARGET_SIMD" > { > - rtx bitmask = gen_reg_rtx (mode); > + machine_mode int_mode = mode; > + rtx bitmask = gen_reg_rtx (int_mode); >emit_move_insn (bitmask, GEN_INT (HOST_WIDE_INT_M1U > << (GET_MODE_BITSIZE (mode) - 1))); > + /* copysign (x, -1) should instead be expanded as orr with the sign > + bit. */ > + auto r0 = CONST_DOUBLE_REAL_VALUE (operands[2]); > + if (-1 == real_to_integer (r0)) Likewise. Thanks, Andrew > +{ > + emit_insn (gen_ior3 ( > + lowpart_subreg (int_mode, operands[0], mode), > + lowpart_subreg (int_mode, operands[1], mode), bitmask)); > + DONE; > +} > + > + operands[2] = force_reg (mode, operands[2]); >emit_insn (gen_copysign3_insn (operands[0], operands[1], operands[2], >bitmask)); >DONE; > > > > > --
[COMMITTED 2/3] Add a dom based ranger for fast VRP.
This patch adds a DOM based ranger that is intended to be used by a dom walk pass and provides basic ranges. It utilizes the new GORI edge API to find outgoing ranges on edges, and combines these with any ranges calculated during the walk up to this point. When a query is made for a range not defined in the current block, a quick dom walk is performed looking for a range either on a single-pred incoming edge or defined in the block. Its about twice the speed of current EVRP, and although there is a bit of room to improve both memory usage and speed, I'll leave that until I either get around to it or we elect to use it and it becomes more important. It also serves as a POC for anyone wanting to use the new GORI API to use edge ranges, as well as a potentially different fast VRP more similar to the old EVRP. This version performs more folding of PHI nodes as it has all the info on incoming edges, but at a slight cost, mostly memory. It does no relation processing as yet. It has been bootstrapped running right after EVRP, and as a replacement for EVRP, and since it uses existing machinery, should be reasonably solid. It is currently not invoked from anywhere. Pushed. Andrew From ad8cd713b4e489826e289551b8b8f8f708293a5b Mon Sep 17 00:00:00 2001 From: Andrew MacLeod Date: Fri, 28 Jul 2023 13:18:15 -0400 Subject: [PATCH 2/3] Add a dom based ranger for fast VRP. Provide a dominator based implementation of a range query. * gimple_range.cc (dom_ranger::dom_ranger): New. (dom_ranger::~dom_ranger): New. (dom_ranger::range_of_expr): New. (dom_ranger::edge_range): New. (dom_ranger::range_on_edge): New. (dom_ranger::range_in_bb): New. (dom_ranger::range_of_stmt): New. (dom_ranger::maybe_push_edge): New. (dom_ranger::pre_bb): New. (dom_ranger::post_bb): New. * gimple-range.h (class dom_ranger): New. --- gcc/gimple-range.cc | 300 gcc/gimple-range.h | 28 + 2 files changed, 328 insertions(+) diff --git a/gcc/gimple-range.cc b/gcc/gimple-range.cc index 13c3308d537..5e9bb397a20 100644 --- a/gcc/gimple-range.cc +++ b/gcc/gimple-range.cc @@ -928,3 +928,303 @@ assume_query::dump (FILE *f) } fprintf (f, "--\n"); } + +// --- + + +// Create a DOM based ranger for use by a DOM walk pass. + +dom_ranger::dom_ranger () : m_global (), m_out () +{ + m_freelist.create (0); + m_freelist.truncate (0); + m_e0.create (0); + m_e0.safe_grow_cleared (last_basic_block_for_fn (cfun)); + m_e1.create (0); + m_e1.safe_grow_cleared (last_basic_block_for_fn (cfun)); + m_pop_list = BITMAP_ALLOC (NULL); + if (dump_file && (param_ranger_debug & RANGER_DEBUG_TRACE)) +tracer.enable_trace (); +} + +// Dispose of a DOM ranger. + +dom_ranger::~dom_ranger () +{ + if (dump_file && (dump_flags & TDF_DETAILS)) +{ + fprintf (dump_file, "Non-varying global ranges:\n"); + fprintf (dump_file, "=:\n"); + m_global.dump (dump_file); +} + BITMAP_FREE (m_pop_list); + m_e1.release (); + m_e0.release (); + m_freelist.release (); +} + +// Implement range of EXPR on stmt S, and return it in R. +// Return false if no range can be calculated. + +bool +dom_ranger::range_of_expr (vrange &r, tree expr, gimple *s) +{ + unsigned idx; + if (!gimple_range_ssa_p (expr)) +return get_tree_range (r, expr, s); + + if ((idx = tracer.header ("range_of_expr "))) +{ + print_generic_expr (dump_file, expr, TDF_SLIM); + if (s) + { + fprintf (dump_file, " at "); + print_gimple_stmt (dump_file, s, 0, TDF_SLIM); + } + else + fprintf (dump_file, "\n"); +} + + if (s) +range_in_bb (r, gimple_bb (s), expr); + else +m_global.range_of_expr (r, expr, s); + + if (idx) +tracer.trailer (idx, " ", true, expr, r); + return true; +} + + +// Return TRUE and the range if edge E has a range set for NAME in +// block E->src. + +bool +dom_ranger::edge_range (vrange &r, edge e, tree name) +{ + bool ret = false; + basic_block bb = e->src; + + // Check if BB has any outgoing ranges on edge E. + ssa_lazy_cache *out = NULL; + if (EDGE_SUCC (bb, 0) == e) +out = m_e0[bb->index]; + else if (EDGE_SUCC (bb, 1) == e) +out = m_e1[bb->index]; + + // If there is an edge vector and it has a range, pick it up. + if (out && out->has_range (name)) +ret = out->get_range (r, name); + + return ret; +} + + +// Return the range of EXPR on edge E in R. +// Return false if no range can be calculated. + +bool +dom_ranger::range_on_edge (vrange &r, edge e, tree expr) +{ + basic_block bb = e->src; + unsigned idx; + if ((idx = tracer.header ("range_on_edge "))) +{ + fprintf (dump_file, "%d->%d for ",e->src->index, e->d
[COMMITTED 1/3] Add outgoing range vector calculation API.
This patch adds 2 routine that can be called to generate GORI information. The primar API is: bool gori_on_edge (class ssa_cache &r, edge e, range_query *query = NULL, gimple_outgoing_range *ogr = NULL); This will populate an ssa-cache R with any ranges that are generated by edge E. It will use QUERY, if provided, to satisfy any incoming values. if OGR is provided, it is used to pick up hard edge values.. like TRUE, FALSE, or switch edges. It currently only works for TRUE/FALSE conditionals, and doesn't try to solve complex logical combinations. ie (a <6 && b > 6) || (a>10 || b < 3) as those can get exponential and require multiple evaluations of the IL to satisfy. It will fully utilize range-ops however and so comes up with many ranges ranger does. It also provides the "raw" ranges on the edge.. ie. it doesn't try to figure out anything outside the current basic block, but rather reflects exactly what the edge indicates. ie: : x.0_1 = (unsigned int) x_20(D); _2 = x.0_1 + 4294967292; if (_2 > 4) goto ; [INV] else goto ; [INV] produces Edge ranges BB 2->3 x.0_1 : [irange] unsigned int [0, 3][9, +INF] _2 : [irange] unsigned int [5, +INF] x_20(D) : [irange] int [-INF, 3][9, +INF] Edge ranges BB 2->4 x.0_1 : [irange] unsigned int [4, 8] MASK 0xf VALUE 0x0 _2 : [irange] unsigned int [0, 4] x_20(D) : [irange] int [4, 8] MASK 0xf VALUE 0x0 It performs a linear walk through juts the required statements, so each of the the above vectors are generated by visiting each of the 3 statements exactly once, so its pretty quick. The other entry point is: bool gori_name_on_edge (vrange &r, tree name, edge e, range_query *q); This does basically the same thing, except it only looks at whether NAME has a range, and returns it if it does. not other overhead. Pushed. From 52c1e2c805bc2fd7a30583dce3608b738f3a5ce4 Mon Sep 17 00:00:00 2001 From: Andrew MacLeod Date: Tue, 15 Aug 2023 17:29:58 -0400 Subject: [PATCH 1/3] Add outgoing range vector calcualtion API Provide a GORI API which can produce a range vector for all outgoing ranges on an edge without any of the other infratructure. * gimple-range-gori.cc (gori_stmt_info::gori_stmt_info): New. (gori_calc_operands): New. (gori_on_edge): New. (gori_name_helper): New. (gori_name_on_edge): New. * gimple-range-gori.h (gori_on_edge): New prototype. (gori_name_on_edge): New prototype. --- gcc/gimple-range-gori.cc | 213 +++ gcc/gimple-range-gori.h | 15 +++ 2 files changed, 228 insertions(+) diff --git a/gcc/gimple-range-gori.cc b/gcc/gimple-range-gori.cc index 2694e551d73..1b5eda43390 100644 --- a/gcc/gimple-range-gori.cc +++ b/gcc/gimple-range-gori.cc @@ -1605,3 +1605,216 @@ gori_export_iterator::get_name () } return NULL_TREE; } + +// This is a helper class to set up STMT with a known LHS for further GORI +// processing. + +class gori_stmt_info : public gimple_range_op_handler +{ +public: + gori_stmt_info (vrange &lhs, gimple *stmt, range_query *q); + Value_Range op1_range; + Value_Range op2_range; + tree ssa1; + tree ssa2; +}; + + +// Uses query Q to get the known ranges on STMT with a LHS range +// for op1_range and op2_range and set ssa1 and ssa2 if either or both of +// those operands are SSA_NAMES. + +gori_stmt_info::gori_stmt_info (vrange &lhs, gimple *stmt, range_query *q) + : gimple_range_op_handler (stmt) +{ + ssa1 = NULL; + ssa2 = NULL; + // Don't handle switches as yet for vector processing. + if (is_a (stmt)) +return; + + // No frther processing for VARYING or undefined. + if (lhs.undefined_p () || lhs.varying_p ()) +return; + + // If there is no range-op handler, we are also done. + if (!*this) +return; + + // Only evaluate logical cases if both operands must be the same as the LHS. + // Otherwise its becomes exponential in time, as well as more complicated. + if (is_gimple_logical_p (stmt)) +{ + gcc_checking_assert (range_compatible_p (lhs.type (), boolean_type_node)); + enum tree_code code = gimple_expr_code (stmt); + if (code == TRUTH_OR_EXPR || code == BIT_IOR_EXPR) + { + // [0, 0] = x || y means both x and y must be zero. + if (!lhs.singleton_p () || !lhs.zero_p ()) + return; + } + else if (code == TRUTH_AND_EXPR || code == BIT_AND_EXPR) + { + // [1, 1] = x && y means both x and y must be one. + if (!lhs.singleton_p () || lhs.zero_p ()) + return; + } +} + + tree op1 = operand1 (); + tree op2 = operand2 (); + ssa1 = gimple_range_ssa_p (op1); + ssa2 = gimple_range_ssa_p (op2); + // If both operands are the same, only process one of them. + if (ssa1 && ssa1 == ssa2) +ssa2 = NULL_TREE; + + // Extract current ranges for the operands. + fur_stmt src (stmt, q); + if (op1) +{ + op1_range.set_type (TREE_TYPE (op1)); + src.get_operand (op1_range, o
[COMMITTED 3/3] Create a fast VRP pass
This patch adds a fast VRP pass. It is not invoked from anywhere, so should cause no issues. If you want to utilize it, simply add a new pass, ie: --- a/gcc/passes.def +++ b/gcc/passes.def @@ -92,6 +92,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_phiprop); NEXT_PASS (pass_fre, true /* may_iterate */); NEXT_PASS (pass_early_vrp); + NEXT_PASS (pass_fast_vrp); NEXT_PASS (pass_merge_phi); NEXT_PASS (pass_dse); NEXT_PASS (pass_cd_dce, false /* update_address_taken_p */); it will generate a dump file with the extension .fvrp. pushed. From f4e2dac53fd62fbf2af95e0bf26d24e929fa1f66 Mon Sep 17 00:00:00 2001 From: Andrew MacLeod Date: Mon, 2 Oct 2023 18:32:49 -0400 Subject: [PATCH 3/3] Create a fast VRP pass * timevar.def (TV_TREE_FAST_VRP): New. * tree-pass.h (make_pass_fast_vrp): New prototype. * tree-vrp.cc (class fvrp_folder): New. (fvrp_folder::fvrp_folder): New. (fvrp_folder::~fvrp_folder): New. (fvrp_folder::value_of_expr): New. (fvrp_folder::value_on_edge): New. (fvrp_folder::value_of_stmt): New. (fvrp_folder::pre_fold_bb): New. (fvrp_folder::post_fold_bb): New. (fvrp_folder::pre_fold_stmt): New. (fvrp_folder::fold_stmt): New. (execute_fast_vrp): New. (pass_data_fast_vrp): New. (pass_vrp:execute): Check for fast VRP pass. (make_pass_fast_vrp): New. --- gcc/timevar.def | 1 + gcc/tree-pass.h | 1 + gcc/tree-vrp.cc | 124 3 files changed, 126 insertions(+) diff --git a/gcc/timevar.def b/gcc/timevar.def index 9523598f60e..d21b08c030d 100644 --- a/gcc/timevar.def +++ b/gcc/timevar.def @@ -160,6 +160,7 @@ DEFTIMEVAR (TV_TREE_TAIL_MERGE , "tree tail merge") DEFTIMEVAR (TV_TREE_VRP , "tree VRP") DEFTIMEVAR (TV_TREE_VRP_THREADER , "tree VRP threader") DEFTIMEVAR (TV_TREE_EARLY_VRP, "tree Early VRP") +DEFTIMEVAR (TV_TREE_FAST_VRP , "tree Fast VRP") DEFTIMEVAR (TV_TREE_COPY_PROP, "tree copy propagation") DEFTIMEVAR (TV_FIND_REFERENCED_VARS , "tree find ref. vars") DEFTIMEVAR (TV_TREE_PTA , "tree PTA") diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h index eba2d54ac76..9c4b1e4185c 100644 --- a/gcc/tree-pass.h +++ b/gcc/tree-pass.h @@ -470,6 +470,7 @@ extern gimple_opt_pass *make_pass_check_data_deps (gcc::context *ctxt); extern gimple_opt_pass *make_pass_copy_prop (gcc::context *ctxt); extern gimple_opt_pass *make_pass_isolate_erroneous_paths (gcc::context *ctxt); extern gimple_opt_pass *make_pass_early_vrp (gcc::context *ctxt); +extern gimple_opt_pass *make_pass_fast_vrp (gcc::context *ctxt); extern gimple_opt_pass *make_pass_vrp (gcc::context *ctxt); extern gimple_opt_pass *make_pass_assumptions (gcc::context *ctxt); extern gimple_opt_pass *make_pass_uncprop (gcc::context *ctxt); diff --git a/gcc/tree-vrp.cc b/gcc/tree-vrp.cc index 4f8c7745461..19d8f995d70 100644 --- a/gcc/tree-vrp.cc +++ b/gcc/tree-vrp.cc @@ -1092,6 +1092,106 @@ execute_ranger_vrp (struct function *fun, bool warn_array_bounds_p, return 0; } +// Implement a Fast VRP folder. Not quite as effective but faster. + +class fvrp_folder : public substitute_and_fold_engine +{ +public: + fvrp_folder (dom_ranger *dr) : substitute_and_fold_engine (), + m_simplifier (dr) + { m_dom_ranger = dr; } + + ~fvrp_folder () { } + + tree value_of_expr (tree name, gimple *s = NULL) override + { +// Shortcircuit subst_and_fold callbacks for abnormal ssa_names. +if (TREE_CODE (name) == SSA_NAME && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (name)) + return NULL; +return m_dom_ranger->value_of_expr (name, s); + } + + tree value_on_edge (edge e, tree name) override + { +// Shortcircuit subst_and_fold callbacks for abnormal ssa_names. +if (TREE_CODE (name) == SSA_NAME && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (name)) + return NULL; +return m_dom_ranger->value_on_edge (e, name); + } + + tree value_of_stmt (gimple *s, tree name = NULL) override + { +// Shortcircuit subst_and_fold callbacks for abnormal ssa_names. +if (TREE_CODE (name) == SSA_NAME && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (name)) + return NULL; +return m_dom_ranger->value_of_stmt (s, name); + } + + void pre_fold_bb (basic_block bb) override + { +m_dom_ranger->pre_bb (bb); +// Now process the PHIs in advance. +gphi_iterator psi = gsi_start_phis (bb); +for ( ; !gsi_end_p (psi); gsi_next (&psi)) + { + tree name = gimple_range_ssa_p (PHI_RESULT (psi.phi ())); + if (name) + { + Value_Range vr(TREE_TYPE (name)); + m_dom_ranger->range_of_stmt (vr, psi.phi (), name); + } + } + } + + void post_fold_bb (basic_block bb) override + { +m_dom_ranger->post_bb (bb); + } + + void pre_fold_stmt (gimple *s) override + { +// Ensure range_of_stmt has been called
[COMMITTED 0/3] Add a FAST VRP pass.
the following set of 3 patches provide the infrastructure for a fast vrp pass. The pass is currently not invoked anywhere, but I wanted to get the infrastructure bits in place now... just in case we want to use it somewhere. It clearly bootstraps with no regressions since it isn't being invoked :-) I have however bootstrapped it with calls to the new fast-vrp pass immediately following the EVRP, and as an EVRP replacement . This is to primarily ensure it isn't doing anything harmful. That is a test of sorts :-). I also ran it instead of EVRP, and it bootstraps, but does trigger a few regressions, all related to relation processing, which it doesn't do. Patch one provides a new API for GORI which simply provides a list of all the ranges that it can generate on an outgoing edge. It utilizes the sparse ssa-cache, and simply sets the outgoing range as determines by the edge. Its very efficient, only walking up the chain once and not generating any other utillity structures. This provides fats an easy access to any info an edge may provide. There is a second API for querying a specific name instead of asking for all the ranges. It should be pretty solid as is simply invokes ranges-ops and other components the same way the larger GORI engine does, it just puts them together in a different way Patch 2 is the new DOM ranger. It assumes it will be called in DOM order, and evaluates the statements, and tracks any ranges on outgoing edges. Queries for ranges walk the dom tree looking for a range until it finds one on an edge or hits the definition block. There are additional efficiencies that can be employed, and I'll eventually get back to them. Patch 3 is the FAST VRP pass and folder. Its pretty straightforward, invokes the new DOM ranger, and enables you to add MAKE_PASS (pass_fast_vrp) in passes. def. Timewise, it is currently about twice as fast as EVRP. It does basic range evaluation and fold PHIs, etc. It does *not* do relation processing or any of the fancier things we do (like statement side effects). A little additional work can reduce the memory footprint further too. I have done no experiments as yet as to the cot of adding relations, but it would be pretty straightforward as it is just reusing all the same components the main ranger does Andrew
Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
On Thu, Oct 5, 2023 at 12:48 PM Tamar Christina wrote: > > > -Original Message- > > From: Richard Sandiford > > Sent: Thursday, October 5, 2023 8:29 PM > > To: Tamar Christina > > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw > > ; Marcus Shawcroft > > ; Kyrylo Tkachov > > Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign. > > > > Tamar Christina writes: > > > Hi All, > > > > > > This adds an implementation for masked copysign along with an > > > optimized pattern for masked copysign (x, -1). > > > > It feels like we're ending up with a lot of AArch64-specific code that just > > hard- > > codes the observation that changing the sign is equivalent to changing the > > top > > bit. We then need to make sure that we choose the best way of changing the > > top bit for any given situation. > > > > Hard-coding the -1/negative case is one instance of that. But it looks > > like we > > also fail to use the best sequence for SVE2. E.g. > > [https://godbolt.org/z/ajh3MM5jv]: > > > > #include > > > > void f(double *restrict a, double *restrict b) { > > for (int i = 0; i < 100; ++i) > > a[i] = __builtin_copysign(a[i], b[i]); } > > > > void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) { > > for (int i = 0; i < 100; ++i) > > a[i] = (a[i] & ~c) | (b[i] & c); } > > > > gives: > > > > f: > > mov x2, 0 > > mov w3, 100 > > whilelo p7.d, wzr, w3 > > .L2: > > ld1dz30.d, p7/z, [x0, x2, lsl 3] > > ld1dz31.d, p7/z, [x1, x2, lsl 3] > > and z30.d, z30.d, #0x7fff > > and z31.d, z31.d, #0x8000 > > orr z31.d, z31.d, z30.d > > st1dz31.d, p7, [x0, x2, lsl 3] > > incdx2 > > whilelo p7.d, w2, w3 > > b.any .L2 > > ret > > g: > > mov x3, 0 > > mov w4, 100 > > mov z29.d, x2 > > whilelo p7.d, wzr, w4 > > .L6: > > ld1dz30.d, p7/z, [x0, x3, lsl 3] > > ld1dz31.d, p7/z, [x1, x3, lsl 3] > > bsl z31.d, z31.d, z30.d, z29.d > > st1dz31.d, p7, [x0, x3, lsl 3] > > incdx3 > > whilelo p7.d, w3, w4 > > b.any .L6 > > ret > > > > I saw that you originally tried to do this in match.pd and that the > > decision was > > to fold to copysign instead. But perhaps there's a compromise where isel > > does > > something with the (new) copysign canonical form? > > I.e. could we go with your new version of the match.pd patch, and add some > > isel stuff as a follow-on? > > > > Sure if that's what's desired But.. > > The example you posted above is for instance worse for x86 > https://godbolt.org/z/x9ccqxW6T > where the first operation has a dependency chain of 2 and the latter of 3. > It's likely any > open coding of this operation is going to hurt a target. But that is because it is not using andn when it should be. That would be https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94790 (scalar fix but not vector) and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 IIRC. AARCH64 already has a pattern to match the above which is why it works there but not x86_64. Thanks, Andrew > > So I'm unsure what isel transform this into... > > Tamar > > > Not saying no to this patch, just thought that the above was worth > > considering. > > > > [I agree with Andrew's comments FWIW.] > > > > Thanks, > > Richard > > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > > > Ok for master? > > > > > > Thanks, > > > Tamar > > > > > > gcc/ChangeLog: > > > > > > PR tree-optimization/109154 > > > * config/aarch64/aarch64-sve.md (cond_copysign): New. > > > > > > gcc/testsuite/ChangeLog: > > > > > > PR tree-optimization/109154 > > > * gcc.target/aarch64/sve/fneg-abs_5.c: New test. > > > > > > --- inline copy of patch -- > > > diff --git a/gcc/config/aarch64/aarch64-sve.md > > > b/gcc/config/aarch64/aarch64-sve.md > > > index > > > > > 071400c820a5b106ddf9dc9faebb117975d74ea0..00ca30c24624dc661254 > > 568f45b6 > > > 1a14aa11c305 1006
[PATCH] MATCH: Fix infinite loop between `vec_cond(vec_cond(a, b, 0), c, d)` and `a & b`
Match has a pattern which converts `vec_cond(vec_cond(a,b,0), c, d)` into `vec_cond(a & b, c, d)` but since in this case a is a comparison fold will change `a & b` back into `vec_cond(a,b,0)` which causes an infinite loop. The best way to fix this is to enable the patterns for vec_cond(*,vec_cond,*) only for GIMPLE so we don't get an infinite loop for fold any more. Note this is a latent bug since these patterns were added in r11-2577-g229752afe3156a and was exposed by r14-3350-g47b833a9abe1 where now able to remove a VIEW_CONVERT_EXPR. OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions. PR middle-end/111699 gcc/ChangeLog: * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): Enable only for GIMPLE. gcc/testsuite/ChangeLog: * gcc.c-torture/compile/pr111699-1.c: New test. --- gcc/match.pd | 5 + gcc/testsuite/gcc.c-torture/compile/pr111699-1.c | 7 +++ 2 files changed, 12 insertions(+) create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr111699-1.c diff --git a/gcc/match.pd b/gcc/match.pd index 4bdd83e6e06..31bfd8b6b68 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -5045,6 +5045,10 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) /* (v ? w : 0) ? a : b is just (v & w) ? a : b Currently disabled after pass lvec because ARM understands VEC_COND_EXPR but not a plain v==w fed to BIT_IOR_EXPR. */ +#if GIMPLE +/* These can only be done in gimple as fold likes to convert: + (CMP) & N into (CMP) ? N : 0 + and we try to match the same pattern again and again. */ (simplify (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2) (if (optimize_vectors_before_lowering_p () && types_match (@0, @3)) @@ -5079,6 +5083,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (vec_cond @0 @3 (vec_cond:s @1 @2 @3)) (if (optimize_vectors_before_lowering_p () && types_match (@0, @1)) (vec_cond (bit_and (bit_not @0) @1) @2 @3))) +#endif /* Canonicalize mask ? { 0, ... } : { -1, ...} to ~mask if the mask types are compatible. */ diff --git a/gcc/testsuite/gcc.c-torture/compile/pr111699-1.c b/gcc/testsuite/gcc.c-torture/compile/pr111699-1.c new file mode 100644 index 000..87b127ed199 --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/compile/pr111699-1.c @@ -0,0 +1,7 @@ +typedef unsigned char __attribute__((__vector_size__ (8))) V; + +void +foo (V *v) +{ + *v = (V) 0x107B9A7FF >= (*v <= 0); +} -- 2.39.3
[committed] amdgcn: silence warning
I've just committed this simple patch to silence an enum warning. Andrewamdgcn: silence warning gcc/ChangeLog: * config/gcn/gcn.cc (print_operand): Adjust xcode type to fix warning. diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc index f6cff659703..ef3b6472a52 100644 --- a/gcc/config/gcn/gcn.cc +++ b/gcc/config/gcn/gcn.cc @@ -6991,7 +6991,7 @@ print_operand_address (FILE *file, rtx mem) void print_operand (FILE *file, rtx x, int code) { - int xcode = x ? GET_CODE (x) : 0; + rtx_code xcode = x ? GET_CODE (x) : UNKNOWN; bool invert = false; switch (code) {
[committed] amdgcn: switch mov insns to compact syntax
I've just committed this patch. It should have no functional changes except to make it easier to add new alternatives into the alternative-heavy move instructions. Andrewamdgcn: switch mov insns to compact syntax The move instructions typically have many alternatives (and I'm about to add more) so are good candidates for the new syntax. This patch only converts the patterns where there are no significant changes to the generated files. The other patterns can be converted another time. gcc/ChangeLog: * config/gcn/gcn-valu.md (*mov): Convert to compact syntax. (mov_exec): Likewise. (mov_sgprbase): Likewise. * config/gcn/gcn.md (*mov_insn): Likewise. (*movti_insn): Likewise. diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md index 284dda73da9..32b170e8522 100644 --- a/gcc/config/gcn/gcn-valu.md +++ b/gcc/config/gcn/gcn-valu.md @@ -457,23 +457,21 @@ (define_insn "*mov" (set_attr "length" "4,8")]) (define_insn "mov_exec" - [(set (match_operand:V_1REG 0 "nonimmediate_operand" "=v, v, v, v, v, m") + [(set (match_operand:V_1REG 0 "nonimmediate_operand") (vec_merge:V_1REG - (match_operand:V_1REG 1 "general_operand" "vA, B, v,vA, m, v") - (match_operand:V_1REG 2 "gcn_alu_or_unspec_operand" -"U0,U0,vA,vA,U0,U0") - (match_operand:DI 3 "register_operand" " e, e,cV,Sv, e, e"))) - (clobber (match_scratch: 4 "=X, X, X, X,&v,&v"))] + (match_operand:V_1REG 1 "general_operand") + (match_operand:V_1REG 2 "gcn_alu_or_unspec_operand") + (match_operand:DI 3 "register_operand"))) + (clobber (match_scratch: 4))] "!MEM_P (operands[0]) || REG_P (operands[1])" - "@ - v_mov_b32\t%0, %1 - v_mov_b32\t%0, %1 - v_cndmask_b32\t%0, %2, %1, vcc - v_cndmask_b32\t%0, %2, %1, %3 - # - #" - [(set_attr "type" "vop1,vop1,vop2,vop3a,*,*") - (set_attr "length" "4,8,4,8,16,16")]) + {@ [cons: =0, 1, 2, 3, =4; attrs: type, length] + [v,vA,U0,e ,X ;vop1 ,4 ] v_mov_b32\t%0, %1 + [v,B ,U0,e ,X ;vop1 ,8 ] v_mov_b32\t%0, %1 + [v,v ,vA,cV,X ;vop2 ,4 ] v_cndmask_b32\t%0, %2, %1, vcc + [v,vA,vA,Sv,X ;vop3a,8 ] v_cndmask_b32\t%0, %2, %1, %3 + [v,m ,U0,e ,&v;*,16] # + [m,v ,U0,e ,&v;*,16] # + }) ; This variant does not accept an unspec, but does permit MEM ; read/modify/write which is necessary for maskstore. @@ -644,19 +642,18 @@ (define_insn "mov_exec" ; flat_load v, vT (define_insn "mov_sgprbase" - [(set (match_operand:V_1REG 0 "nonimmediate_operand" "= v, v, v, m") + [(set (match_operand:V_1REG 0 "nonimmediate_operand") (unspec:V_1REG - [(match_operand:V_1REG 1 "general_operand" " vA,vB, m, v")] + [(match_operand:V_1REG 1 "general_operand")] UNSPEC_SGPRBASE)) - (clobber (match_operand: 2 "register_operand" "=&v,&v,&v,&v"))] + (clobber (match_operand: 2 "register_operand"))] "lra_in_progress || reload_completed" - "@ - v_mov_b32\t%0, %1 - v_mov_b32\t%0, %1 - # - #" - [(set_attr "type" "vop1,vop1,*,*") - (set_attr "length" "4,8,12,12")]) + {@ [cons: =0, 1, =2; attrs: type, length] + [v,vA,&v;vop1,4 ] v_mov_b32\t%0, %1 + [v,vB,&v;vop1,8 ] ^ + [v,m ,&v;* ,12] # + [m,v ,&v;* ,12] # + }) (define_insn "mov_sgprbase" [(set (match_operand:V_2REG 0 "nonimmediate_operand" "= v, v, m") @@ -676,17 +673,17 @@ (define_insn "mov_sgprbase" (set_attr "length" "8,12,12")]) (define_insn "mov_sgprbase" - [(set (match_operand:V_4REG 0 "nonimmediate_operand" "= v, v, m") + [(set (match_operand:V_4REG 0 "nonimmediate_operand") (unspec:V_4REG - [(match_operand:V_4REG 1 "general_operand" "vDB, m, v")] + [(match_operand:V_4REG 1 "general_operand")] UNSPEC_SGPRBASE)) - (clobber (match_operand: 2 "register_operand" "=&v,&v,&v"))] + (clobber (match_operand: 2 "register_operand"))] "lra_in_progress || reload_completed" - "v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1\;v_mov_b32\t%J0, %J1\;v_mov_b32\t%K0, %K1 - # - #" - [(set_attr "type" "vmult,*,*") - (set_attr "length" "8,12,12")]) + {@ [cons: =0, 1, =2; attrs: type, length] + [v,vDB,&v;vmult,8 ] v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1\;v_mov_b32\t%J0, %J1\;v_mov_b32\t%K0, %K1 + [v,m ,&v;*,12] # + [m,v ,&v;*,12] # + }) ; reload_in was once a standard name, but here it's only referenced by ; gcn_secondary_reload. It allows a reload with a scratch register. diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md index 7065acf402b..30fe9e34a35 100644 --- a/gcc/config/gcn/gcn.md +++ b/gcc/config/gcn/gcn.md @@ -542,87 +542,76 @@ (define_insn "*movbi" ; 32bit move pattern (define_insn "*mov_insn" - [(set (match_operand:SISF 0 "nonimmediate_operand" - "=SD,SD,SD,SD,RB,Sm,RS,v,Sg, v, v,RF,v,RLRG, v,SD, v,RM") - (match_operand:SISF 1 "gcn_load_operand" - "SSA, J, B,RB,Sm,RS,Sm,v, v,S
Re: [PATCH] test: Isolate slp-1.c check of target supports vect_strided5
On 15/09/2023 10:16, Juzhe-Zhong wrote: This test failed in RISC-V: FAIL: gcc.dg/vect/slp-1.c -flto -ffat-lto-objects scan-tree-dump-times vect "vectorizing stmts using SLP" 4 FAIL: gcc.dg/vect/slp-1.c scan-tree-dump-times vect "vectorizing stmts using SLP" 4 Because this loop: /* SLP with unrolling by 8. */ for (i = 0; i < N; i++) { out[i*5] = 8; out[i*5 + 1] = 7; out[i*5 + 2] = 81; out[i*5 + 3] = 28; out[i*5 + 4] = 18; } is using vect_load_lanes with array size = 5. instead of SLP. When we adjust the COST of LANES load store, then it will use SLP. gcc/testsuite/ChangeLog: * gcc.dg/vect/slp-1.c: Add vect_stried5. --- gcc/testsuite/gcc.dg/vect/slp-1.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/slp-1.c b/gcc/testsuite/gcc.dg/vect/slp-1.c index 82e4f6469fb..d4a13f12df6 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-1.c +++ b/gcc/testsuite/gcc.dg/vect/slp-1.c @@ -122,5 +122,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 4 loops" 1 "vect" } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" } } */ - +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { target {! vect_strided5 } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target vect_strided5 } } } */ This patch causes a test regression on amdgcn because vect_strided5 is true (because check_effective_target_vect_fully_masked is true), but the testcase still gives the message 4 times. Perhaps because amdgcn uses masking and not vect_load_lanes? Andrew
Re: [PATCH]middle-end match.pd: optimize fneg (fabs (x)) to x | (1 << signbit(x)) [PR109154]
> >>>> --- a/gcc/match.pd > > >>>> +++ b/gcc/match.pd > > >>>> @@ -1074,45 +1074,43 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > >>>> > > >>>> /* cos(copysign(x, y)) -> cos(x). Similarly for cosh. */ > > >>>> (for coss (COS COSH) > > >>>> - copysigns (COPYSIGN) > > >>>> - (simplify > > >>>> - (coss (copysigns @0 @1)) > > >>>> - (coss @0))) > > >>>> + (for copysigns (COPYSIGN_ALL) > > >>> > > >>> So this ends up generating for example the match > > >>> (cosf (copysignl ...)) which doesn't make much sense. > > >>> > > >>> The lock-step iteration did > > >>> (cosf (copysignf ..)) ... (ifn_cos (ifn_copysign ...)) > > >>> which is leaner but misses the case of > > >>> (cosf (ifn_copysign ..)) - that's probably what you are > > >>> after with this change. > > >>> > > >>> That said, there isn't a nice solution (without altering the match.pd > > >>> IL). There's the explicit solution, spelling out all combinations. > > >>> > > >>> So if we want to go with yout pragmatic solution changing this > > >>> to use COPYSIGN_ALL isn't necessary, only changing the lock-step > > >>> for iteration to a cross product for iteration is. > > >>> > > >>> Changing just this pattern to > > >>> > > >>> (for coss (COS COSH) > > >>> (for copysigns (COPYSIGN) > > >>> (simplify > > >>> (coss (copysigns @0 @1)) > > >>> (coss @0 > > >>> > > >>> increases the total number of gimple-match-x.cc lines from > > >>> 234988 to 235324. > > >> > > >> I guess the difference between this and the later suggestions is that > > >> this one allows builtin copysign to be paired with ifn cos, which would > > >> be potentially useful in other situations. (It isn't here because > > >> ifn_cos is rarely provided.) How much of the growth is due to that, > > >> and much of it is from nonsensical combinations like > > >> (builtin_cosf (builtin_copysignl ...))? > > >> > > >> If it's mostly from nonsensical combinations then would it be possible > > >> to make genmatch drop them? > > >> > > >>> The alternative is to do > > >>> > > >>> (for coss (COS COSH) > > >>> copysigns (COPYSIGN) > > >>> (simplify > > >>> (coss (copysigns @0 @1)) > > >>> (coss @0)) > > >>> (simplify > > >>> (coss (IFN_COPYSIGN @0 @1)) > > >>> (coss @0))) > > >>> > > >>> which properly will diagnose a duplicate pattern. Ther are > > >>> currently no operator lists with just builtins defined (that > > >>> could be fixed, see gencfn-macros.cc), supposed we'd have > > >>> COS_C we could do > > >>> > > >>> (for coss (COS_C COSH_C IFN_COS IFN_COSH) > > >>> copysigns (COPYSIGN_C COPYSIGN_C IFN_COPYSIGN IFN_COPYSIGN > > >>> IFN_COPYSIGN IFN_COPYSIGN IFN_COPYSIGN IFN_COPYSIGN IFN_COPYSIGN > > >>> IFN_COPYSIGN) > > >>> (simplify > > >>> (coss (copysigns @0 @1)) > > >>> (coss @0))) > > >>> > > >>> which of course still looks ugly ;) (some syntax extension like > > >>> allowing to specify IFN_COPYSIGN*8 would be nice here and easy > > >>> enough to do) > > >>> > > >>> Can you split out the part changing COPYSIGN to COPYSIGN_ALL, > > >>> re-do it to only split the fors, keeping COPYSIGN and provide > > >>> some statistics on the gimple-match-* size? I think this might > > >>> be the pragmatic solution for now. > > >>> > > >>> Richard - can you think of a clever way to express the desired > > >>> iteration? How do RTL macro iterations address cases like this? > > >> > > >> I don't think .md files have an equivalent construct, unfortunately. > > >> (I also regret some of the choices I made for .md iterators, but that's > > >> another story.) > > >> > > >> Perhaps an alternative to the *8 thing would be "IFN
Re: [PATCH] test: Isolate slp-1.c check of target supports vect_strided5
On 07/10/2023 02:04, juzhe.zh...@rivai.ai wrote: Thanks for reporting it. I think we may need to change it into: + /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { target {! vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target vect_strided5 && vect_load_lanes } } } */ Could you verify it whether it work for you ? You need an additional set of curly braces in the second line to avoid a syntax error message, but I get a pass with that change. Thanks Andrew
[COMMITTED] Remove unused get_identity_relation.
I added this routine for Aldy when he thought we were going to have to add explicit versions for unordered relations. It seems that with accurate tracking of NANs, we do not need the explicit versions in the oracle, so we will not need this identity routine to pick the appropriate version of VREL_EQ... as there is only one. As it stands, always returns VREL_EQ, so simply use VREL_EQ in the 2 calling locations. Bootstrapped on x86_64-pc-linux-gnu with no regressions. Pushed. Andrew From 5ee51119d1345f3f13af784455a4ae466766912b Mon Sep 17 00:00:00 2001 From: Andrew MacLeod Date: Mon, 9 Oct 2023 10:01:11 -0400 Subject: [PATCH 1/2] Remove unused get_identity_relation. Turns out we didnt need this as there is no unordered relations managed by the oracle. * gimple-range-gori.cc (gori_compute::compute_operand1_range): Do not call get_identity_relation. (gori_compute::compute_operand2_range): Ditto. * value-relation.cc (get_identity_relation): Remove. * value-relation.h (get_identity_relation): Remove protyotype. --- gcc/gimple-range-gori.cc | 10 ++ gcc/value-relation.cc| 14 -- gcc/value-relation.h | 3 --- 3 files changed, 2 insertions(+), 25 deletions(-) diff --git a/gcc/gimple-range-gori.cc b/gcc/gimple-range-gori.cc index 1b5eda43390..887da0ff094 100644 --- a/gcc/gimple-range-gori.cc +++ b/gcc/gimple-range-gori.cc @@ -1146,10 +1146,7 @@ gori_compute::compute_operand1_range (vrange &r, // If op1 == op2, create a new trio for just this call. if (op1 == op2 && gimple_range_ssa_p (op1)) - { - relation_kind k = get_identity_relation (op1, op1_range); - trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), k); - } + trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), VREL_EQ); if (!handler.calc_op1 (r, lhs, op2_range, trio)) return false; } @@ -1225,10 +1222,7 @@ gori_compute::compute_operand2_range (vrange &r, // If op1 == op2, create a new trio for this stmt. if (op1 == op2 && gimple_range_ssa_p (op1)) -{ - relation_kind k = get_identity_relation (op1, op1_range); - trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), k); -} +trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), VREL_EQ); // Intersect with range for op2 based on lhs and op1. if (!handler.calc_op2 (r, lhs, op1_range, trio)) return false; diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc index 8fea4aad345..a2ae39692a6 100644 --- a/gcc/value-relation.cc +++ b/gcc/value-relation.cc @@ -183,20 +183,6 @@ relation_transitive (relation_kind r1, relation_kind r2) return relation_kind (rr_transitive_table[r1][r2]); } -// When operands of a statement are identical ssa_names, return the -// approriate relation between operands for NAME == NAME, given RANGE. -// -relation_kind -get_identity_relation (tree name, vrange &range ATTRIBUTE_UNUSED) -{ - // Return VREL_UNEQ when it is supported for floats as appropriate. - if (frange::supports_p (TREE_TYPE (name))) -return VREL_EQ; - - // Otherwise return VREL_EQ. - return VREL_EQ; -} - // This vector maps a relation to the equivalent tree code. static const tree_code relation_to_code [VREL_LAST] = { diff --git a/gcc/value-relation.h b/gcc/value-relation.h index f00f84f93b6..be6e277421b 100644 --- a/gcc/value-relation.h +++ b/gcc/value-relation.h @@ -91,9 +91,6 @@ inline bool relation_equiv_p (relation_kind r) void print_relation (FILE *f, relation_kind rel); -// Return relation for NAME == NAME with RANGE. -relation_kind get_identity_relation (tree name, vrange &range); - class relation_oracle { public: -- 2.41.0
[COMMITTED] PR tree-optimization/111694 - Ensure float equivalences include + and - zero.
When ranger propagates ranges in the on-entry cache, it also check for equivalences and incorporates the equivalence into the range for a name if it is known. With floating point values, the equivalence that is generated by comparison must also take into account that if the equivalence contains zero, both positive and negative zeros could be in the range. This PR demonstrates that once we establish an equivalence, even though we know one value may only have a positive zero, the equivalence may have been formed earlier and included a negative zero This patch pessimistically assumes that if the equivalence contains zero, we should include both + and - 0 in the equivalence that we utilize. I audited the other places, and found no other place where this issue might arise. Cache propagation is the only place where we augment the range with random equivalences. Bootstrapped on x86_64-pc-linux-gnu with no regressions. Pushed. Andrew From b0892b1fc637fadf14d7016858983bc5776a1e69 Mon Sep 17 00:00:00 2001 From: Andrew MacLeod Date: Mon, 9 Oct 2023 10:15:07 -0400 Subject: [PATCH 2/2] Ensure float equivalences include + and - zero. A floating point equivalence may not properly reflect both signs of zero, so be pessimsitic and ensure both signs are included. PR tree-optimization/111694 gcc/ * gimple-range-cache.cc (ranger_cache::fill_block_cache): Adjust equivalence range. * value-relation.cc (adjust_equivalence_range): New. * value-relation.h (adjust_equivalence_range): New prototype. gcc/testsuite/ * gcc.dg/pr111694.c: New. --- gcc/gimple-range-cache.cc | 3 +++ gcc/testsuite/gcc.dg/pr111694.c | 19 +++ gcc/value-relation.cc | 19 +++ gcc/value-relation.h| 3 +++ 4 files changed, 44 insertions(+) create mode 100644 gcc/testsuite/gcc.dg/pr111694.c diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc index 3c819933c4e..89c0845457d 100644 --- a/gcc/gimple-range-cache.cc +++ b/gcc/gimple-range-cache.cc @@ -1470,6 +1470,9 @@ ranger_cache::fill_block_cache (tree name, basic_block bb, basic_block def_bb) { if (rel != VREL_EQ) range_cast (equiv_range, type); + else + adjust_equivalence_range (equiv_range); + if (block_result.intersect (equiv_range)) { if (DEBUG_RANGE_CACHE) diff --git a/gcc/testsuite/gcc.dg/pr111694.c b/gcc/testsuite/gcc.dg/pr111694.c new file mode 100644 index 000..a70b03069dc --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr111694.c @@ -0,0 +1,19 @@ +/* PR tree-optimization/111009 */ +/* { dg-do run } */ +/* { dg-options "-O2" } */ + +#define signbit(x) __builtin_signbit(x) + +static void test(double l, double r) +{ + if (l == r && (signbit(l) || signbit(r))) +; + else +__builtin_abort(); +} + +int main() +{ + test(0.0, -0.0); +} + diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc index a2ae39692a6..0326fe7cde6 100644 --- a/gcc/value-relation.cc +++ b/gcc/value-relation.cc @@ -183,6 +183,25 @@ relation_transitive (relation_kind r1, relation_kind r2) return relation_kind (rr_transitive_table[r1][r2]); } +// When one name is an equivalence of another, ensure the equivalence +// range is correct. Specifically for floating point, a +0 is also +// equivalent to a -0 which may not be reflected. See PR 111694. + +void +adjust_equivalence_range (vrange &range) +{ + if (range.undefined_p () || !is_a (range)) +return; + + frange fr = as_a (range); + // If range includes 0 make sure both signs of zero are included. + if (fr.contains_p (dconst0) || fr.contains_p (dconstm0)) +{ + frange zeros (range.type (), dconstm0, dconst0); + range.union_ (zeros); +} + } + // This vector maps a relation to the equivalent tree code. static const tree_code relation_to_code [VREL_LAST] = { diff --git a/gcc/value-relation.h b/gcc/value-relation.h index be6e277421b..31d48908678 100644 --- a/gcc/value-relation.h +++ b/gcc/value-relation.h @@ -91,6 +91,9 @@ inline bool relation_equiv_p (relation_kind r) void print_relation (FILE *f, relation_kind rel); +// Adjust range as an equivalence. +void adjust_equivalence_range (vrange &range); + class relation_oracle { public: -- 2.41.0
[PATCH] MATCH: [PR111679] Add alternative simplification of `a | ((~a) ^ b)`
So currently we have a simplification for `a | ~(a ^ b)` but that does not match the case where we had originally `(~a) | (a ^ b)` so we need to add a new pattern that matches that and uses bitwise_inverted_equal_p that also catches comparisons too. OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions. PR tree-optimization/111679 gcc/ChangeLog: * match.pd (`a | ((~a) ^ b)`): New pattern. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/bitops-5.c: New test. --- gcc/match.pd | 8 +++ gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c | 27 2 files changed, 35 insertions(+) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c diff --git a/gcc/match.pd b/gcc/match.pd index 31bfd8b6b68..49740d189a7 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -1350,6 +1350,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) && TYPE_PRECISION (TREE_TYPE (@0)) == 1) (bit_ior @0 (bit_xor @1 { build_one_cst (type); } +/* a | ((~a) ^ b) --> a | (~b) (alt version of the above 2) */ +(simplify + (bit_ior:c @0 (bit_xor:cs @1 @2)) + (with { bool wascmp; } + (if (bitwise_inverted_equal_p (@0, @1, wascmp) + && (!wascmp || element_precision (type) == 1)) + (bit_ior @0 (bit_not @2) + /* (a | b) | (a &^ b) --> a | b */ (for op (bit_and bit_xor) (simplify diff --git a/gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c b/gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c new file mode 100644 index 000..990610e3002 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized-raw" } */ +/* PR tree-optimization/111679 */ + +int f1(int a, int b) +{ +return (~a) | (a ^ b); // ~(a & b) or (~a) | (~b) +} + +_Bool fb(_Bool c, _Bool d) +{ +return (!c) | (c ^ d); // ~(c & d) or (~c) | (~d) +} + +_Bool fb1(int x, int y) +{ +_Bool a = x == 10, b = y > 100; +return (!a) | (a ^ b); // ~(a & b) or (~a) | (~b) +// or (x != 10) | (y <= 100) +} + +/* { dg-final { scan-tree-dump-not "bit_xor_expr, " "optimized" } } */ +/* { dg-final { scan-tree-dump-times "bit_not_expr, " 2 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "bit_and_expr, " 2 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "bit_ior_expr, " 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "ne_expr, _\[0-9\]+, x_\[0-9\]+" 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "le_expr, _\[0-9\]+, y_\[0-9\]+" 1 "optimized" } } */ -- 2.39.3
Re: [PATCH] use get_range_query to replace get_global_range_query
On Tue, Oct 10, 2023 at 12:02 AM Richard Biener wrote: > > On Tue, 10 Oct 2023, Jiufu Guo wrote: > > > Hi, > > > > For "get_global_range_query" SSA_NAME_RANGE_INFO can be queried. > > For "get_range_query", it could get more context-aware range info. > > And look at the implementation of "get_range_query", it returns > > global range if no local fun info. > > > > So, if not quering for SSA_NAME, it would be ok to use get_range_query > > to replace get_global_range_query. > > > > Patch https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630389.html, > > Uses get_range_query could handle more cases. > > > > This patch replaces get_global_range_query by get_range_query for > > most possible code pieces (but deoes not draft new test cases). > > > > Pass bootstrap & regtest on ppc64{,le} and x86_64. > > Is this ok for trunk. > > See below > > > > > BR, > > Jeff (Jiufu Guo) > > > > gcc/ChangeLog: > > > > * builtins.cc (expand_builtin_strnlen): Replace get_global_range_query > > by get_range_query. > > * fold-const.cc (expr_not_equal_to): Likewise. > > * gimple-fold.cc (size_must_be_zero_p): Likewise. > > * gimple-range-fold.cc (fur_source::fur_source): Likewise. > > * gimple-ssa-warn-access.cc (check_nul_terminated_array): Likewise. > > * tree-dfa.cc (get_ref_base_and_extent): Likewise. > > * tree-ssa-loop-split.cc (split_at_bb_p): Likewise. > > * tree-ssa-loop-unswitch.cc > > (evaluate_control_stmt_using_entry_checks): > > Likewise. > > > > --- > > gcc/builtins.cc | 2 +- > > gcc/fold-const.cc | 6 +- > > gcc/gimple-fold.cc| 6 ++ > > gcc/gimple-range-fold.cc | 4 +--- > > gcc/gimple-ssa-warn-access.cc | 2 +- > > gcc/tree-dfa.cc | 5 + > > gcc/tree-ssa-loop-split.cc| 2 +- > > gcc/tree-ssa-loop-unswitch.cc | 2 +- > > 8 files changed, 9 insertions(+), 20 deletions(-) > > > > diff --git a/gcc/builtins.cc b/gcc/builtins.cc > > index cb90bd03b3e..4e0a77ff8e0 100644 > > --- a/gcc/builtins.cc > > +++ b/gcc/builtins.cc > > @@ -3477,7 +3477,7 @@ expand_builtin_strnlen (tree exp, rtx target, > > machine_mode target_mode) > > > >wide_int min, max; > >value_range r; > > - get_global_range_query ()->range_of_expr (r, bound); > > + get_range_query (cfun)->range_of_expr (r, bound); > > expand doesn't have a ranger instance so this is a no-op. I'm unsure > if it would be safe given we're half GIMPLE, half RTL. Please leave it > out. It definitely does not work and can't as I tried to enable a ranger instance and it didn't work. I wrote up my experience here: https://gcc.gnu.org/pipermail/gcc/2023-September/242407.html Thanks, Andrew Pinski > > >if (r.varying_p () || r.undefined_p ()) > > return NULL_RTX; > >min = r.lower_bound (); > > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc > > index 4f8561509ff..15134b21b9f 100644 > > --- a/gcc/fold-const.cc > > +++ b/gcc/fold-const.cc > > @@ -11056,11 +11056,7 @@ expr_not_equal_to (tree t, const wide_int &w) > >if (!INTEGRAL_TYPE_P (TREE_TYPE (t))) > > return false; > > > > - if (cfun) > > - get_range_query (cfun)->range_of_expr (vr, t); > > - else > > - get_global_range_query ()->range_of_expr (vr, t); > > - > > + get_range_query (cfun)->range_of_expr (vr, t); > > These kind of changes look obvious. > > >if (!vr.undefined_p () && !vr.contains_p (w)) > > return true; > >/* If T has some known zero bits and W has any of those bits set, > > diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc > > index dc89975270c..853edd9e5d4 100644 > > --- a/gcc/gimple-fold.cc > > +++ b/gcc/gimple-fold.cc > > @@ -876,10 +876,8 @@ size_must_be_zero_p (tree size) > >wide_int zero = wi::zero (TYPE_PRECISION (type)); > >value_range valid_range (type, zero, ssize_max); > >value_range vr; > > - if (cfun) > > -get_range_query (cfun)->range_of_expr (vr, size); > > - else > > -get_global_range_query ()->range_of_expr (vr, size); > > + get_range_query (cfun)->range_of_expr (vr, size); > > + > >if (vr.undefined_p ()) > > vr.set_varying (TREE_TYPE (size)); > >vr.intersect (valid_range); > > diff --git a/gcc/gimple-range-fold.cc b/gcc/gimple-ran
Re: [PATCH] RISC-V Regression: Fix FAIL of bb-slp-pr65935.c for RVV
On 10/10/2023 02:39, Juzhe-Zhong wrote: Here is the reference comparing dump IR between ARM SVE and RVV. https://godbolt.org/z/zqess8Gss We can see RVV has one more dump IR: optimized: basic block part vectorized using 128 byte vectors since RVV has 1024 bit vectors. The codegen is reasonable good. However, I saw GCN also has 1024 bit vector. This patch may cause this case FAIL in GCN port ? Hi, GCN folk, could you check this patch in GCN port for me ? This patch *fixes* an existing test fail on GCN. :) It's probably one of the many I've never had time to analyze (and optimizing more than expected makes it low priority). LGTM Andrew
Re: [committed] [PR target/93062] RISC-V: Handle long conditional branches for RISC-V
I remembered another concern since we discussed this patch privately. Using ra for long calls results in a sequence that will corrupt the return-address stack. Corrupting the RAS is potentially more costly than mispredicting a branch, since it can result in a cascading sequence of mispredictions as the program returns up the stack. Of course, if these long calls are dynamically quite rare, this isn't the end of the world. But it's always preferable to use a register other than ra or t0 to avoid this performance reduction. I know nothing about the complexity of register scavenging, but it would be nice to opportunistically use a scratch register (other than t0), falling back to ra only when necessary. Tangentially, I noticed the patch uses `jump label, ra' for far branches but uses `call label' for far jumps. These corrupt the RAS in opposite ways (the former pops the RAS and the latter pushes it. Any reason for using a different sequence in one than the other? On Tue, Oct 10, 2023 at 3:11 PM Jeff Law wrote: > > > Ventana has had a variant of this patch from Andrew W. in its tree for > at least a year. I'm dusting it off and submitting it on Andrew's behalf. > > There's multiple approaches we could be using here. > > First we could make $ra fixed and use it as the scratch register for the > long branch sequences. > > Second, we could add a match_scratch to all the conditional branch > patterns and allow the register allocator to assign the scratch register > from the pool of GPRs. > > Third we could do register scavenging. This can usually work, though it > can get complex in some scenarios. > > Forth we could use trampolines for extended reach. > > Andrew's original patch did a bit of the first approach (make $ra fixed) > and mostly the second approach. The net is it was probably the worst in > terms of impacting code generation -- we lost a register *and* forced > every branch instruction to get a scratch register allocated. > > I had expected the second approach to produce better code than the > first, but that wasn't actually the case in practice. It's probably a > combination of allocating a GPR at every branch point (even with a life > of a single insn, there's a cost) and perhaps the additional operands on > conditional branches spoiling simplistic pattern matching in one or more > passes. > > In addition to performing better based on dynamic instruction counts, > the first approach is significantly simpler to implement. Given those > two positives, that's what I've chosen to go with. Yes it does remove > $ra from the set of registers available, but the impact of that is *tiny*. > > If someone wanted to dive into one of the other approaches to address a > real world impact, that's great. If that happens I would strongly > suggest also evaluating perlbench from spec2017. It seems particularly > sensitive to this issue in terms of approach #2's impact on code generation. > > I've built & regression tested this variant on the vt1 configuration > without regressions. Earlier versions have been bootstrapped as well. > > Pushed to the trunk, > > Jeff >
[PATCH] MATCH: [PR111282] Simplify `a & (b ^ ~a)` to `a & b`
While `a & (b ^ ~a)` is optimized to `a & b` on the rtl level, it is always good to optimize this at the gimple level and allows us to match a few extra things including where a is a comparison. Note I had to update/change the testcase and-1.c to avoid matching this case as we can match -2 and 1 as bitwise inversions. PR tree-optimization/111282 gcc/ChangeLog: * match.pd (`a & ~(a ^ b)`, `a & (a == b)`, `a & ((~a) ^ b)`): New patterns. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/and-1.c: Update testcase to avoid matching `~1 & (a ^ 1)` simplification. * gcc.dg/tree-ssa/bitops-6.c: New test. --- gcc/match.pd | 20 ++ gcc/testsuite/gcc.dg/tree-ssa/and-1.c| 6 ++--- gcc/testsuite/gcc.dg/tree-ssa/bitops-6.c | 33 3 files changed, 56 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/bitops-6.c diff --git a/gcc/match.pd b/gcc/match.pd index 49740d189a7..26b05c157c1 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -1358,6 +1358,26 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) && (!wascmp || element_precision (type) == 1)) (bit_ior @0 (bit_not @2) +/* a & ~(a ^ b) --> a & b */ +(simplify + (bit_and:c @0 (bit_not (bit_xor:c @0 @1))) + (bit_and @0 @1)) + +/* a & (a == b) --> a & b (boolean version of the above). */ +(simplify + (bit_and:c @0 (nop_convert? (eq:c @0 @1))) + (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) + && TYPE_PRECISION (TREE_TYPE (@0)) == 1) + (bit_and @0 @1))) + +/* a & ((~a) ^ b) --> a & b (alt version of the above 2) */ +(simplify + (bit_and:c @0 (bit_xor:c @1 @2)) + (with { bool wascmp; } + (if (bitwise_inverted_equal_p (@0, @1, wascmp) + && (!wascmp || element_precision (type) == 1)) + (bit_and @0 @2 + /* (a | b) | (a &^ b) --> a | b */ (for op (bit_and bit_xor) (simplify diff --git a/gcc/testsuite/gcc.dg/tree-ssa/and-1.c b/gcc/testsuite/gcc.dg/tree-ssa/and-1.c index 276c2b9bd8a..27d38907eea 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/and-1.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/and-1.c @@ -2,10 +2,10 @@ /* { dg-options "-O -fdump-tree-optimized-raw" } */ int f(int in) { - in = in | 3; - in = in ^ 1; + in = in | 7; + in = in ^ 3; in = (in & ~(unsigned long)1); return in; } -/* { dg-final { scan-tree-dump-not "bit_and_expr" "optimized" } } */ +/* { dg-final { scan-tree-dump-not "bit_and_expr, " "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/bitops-6.c b/gcc/testsuite/gcc.dg/tree-ssa/bitops-6.c new file mode 100644 index 000..e6ab2fd6c71 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/bitops-6.c @@ -0,0 +1,33 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized-raw" } */ +/* PR tree-optimization/111282 */ + + +int f(int a, int b) +{ + return a & (b ^ ~a); // a & b +} + +_Bool fb(_Bool x, _Bool y) +{ + return x & (y ^ !x); // x & y +} + +int fa(int w, int z) +{ + return (~w) & (w ^ z); // ~w & z +} + +int fcmp(int x, int y) +{ + _Bool a = x == 2; + _Bool b = y == 1; + return a & (b ^ !a); // (x == 2) & (y == 1) +} + +/* { dg-final { scan-tree-dump-not "bit_xor_expr, " "optimized" } } */ +/* { dg-final { scan-tree-dump-times "bit_and_expr, " 4 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "bit_not_expr, " 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-not "ne_expr, ""optimized" } } */ +/* { dg-final { scan-tree-dump-times "eq_expr, " 2 "optimized" } } */ + -- 2.39.3
Re: [committed] [PR target/93062] RISC-V: Handle long conditional branches for RISC-V
On Tue, Oct 10, 2023 at 8:26 PM Jeff Law wrote: > > > > On 10/10/23 18:24, Andrew Waterman wrote: > > I remembered another concern since we discussed this patch privately. > > Using ra for long calls results in a sequence that will corrupt the > > return-address stack. > Yup. We've actually got data on that internally, it's not showing up in > a significant way in practice. > > >I know nothing > > about the complexity of register scavenging, but it would be nice to > > opportunistically use a scratch register (other than t0), falling back > > to ra only when necessary. > The nice thing about making $ra fixed is some can add a register > scavenging approach, then fall back to $ra if they're unable to find a > register to reuse. > > > > > Tangentially, I noticed the patch uses `jump label, ra' for far > > branches but uses `call label' for far jumps. These corrupt the RAS > > in opposite ways (the former pops the RAS and the latter pushes it. > > Any reason for using a different sequence in one than the other? > I'd noticed it as well -- that's the way it was in the patch that was > already in Ventana's tree ;-) My plan was to address that separately > after dropping in enough infrastructure to allow me to force everything > to be far branches for testing purposes. Sounds like we're thinking many of the same thoughts... thanks for dragging this patch towards the finish line! > > jeff
[COMMITTED][GCC13] PR tree-optimization/111694 - Ensure float equivalences include + and - zero.
Similar patch which was checked into trunk last week. slight tweak needed as dconstm0 was not exported in gcc 13, otherwise functionally the same Bootstrapped on x86_64-pc-linux-gnu. pushed. Andrew commit f0efc4b25cba1bd35b08b7dfbab0f8fc81b55c66 Author: Andrew MacLeod Date: Mon Oct 9 13:40:15 2023 -0400 Ensure float equivalences include + and - zero. A floating point equivalence may not properly reflect both signs of zero, so be pessimsitic and ensure both signs are included. PR tree-optimization/111694 gcc/ * gimple-range-cache.cc (ranger_cache::fill_block_cache): Adjust equivalence range. * value-relation.cc (adjust_equivalence_range): New. * value-relation.h (adjust_equivalence_range): New prototype. gcc/testsuite/ * gcc.dg/pr111694.c: New. diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc index 2314478d558..e4e75943632 100644 --- a/gcc/gimple-range-cache.cc +++ b/gcc/gimple-range-cache.cc @@ -1258,6 +1258,9 @@ ranger_cache::fill_block_cache (tree name, basic_block bb, basic_block def_bb) { if (rel != VREL_EQ) range_cast (equiv_range, type); + else + adjust_equivalence_range (equiv_range); + if (block_result.intersect (equiv_range)) { if (DEBUG_RANGE_CACHE) diff --git a/gcc/testsuite/gcc.dg/pr111694.c b/gcc/testsuite/gcc.dg/pr111694.c new file mode 100644 index 000..a70b03069dc --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr111694.c @@ -0,0 +1,19 @@ +/* PR tree-optimization/111009 */ +/* { dg-do run } */ +/* { dg-options "-O2" } */ + +#define signbit(x) __builtin_signbit(x) + +static void test(double l, double r) +{ + if (l == r && (signbit(l) || signbit(r))) +; + else +__builtin_abort(); +} + +int main() +{ + test(0.0, -0.0); +} + diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc index 30a02d3c9d3..fc792a4d5bc 100644 --- a/gcc/value-relation.cc +++ b/gcc/value-relation.cc @@ -183,6 +183,25 @@ relation_transitive (relation_kind r1, relation_kind r2) return relation_kind (rr_transitive_table[r1][r2]); } +// When one name is an equivalence of another, ensure the equivalence +// range is correct. Specifically for floating point, a +0 is also +// equivalent to a -0 which may not be reflected. See PR 111694. + +void +adjust_equivalence_range (vrange &range) +{ + if (range.undefined_p () || !is_a (range)) +return; + + frange fr = as_a (range); + REAL_VALUE_TYPE dconstm0 = dconst0; + dconstm0.sign = 1; + frange zeros (range.type (), dconstm0, dconst0); + // If range includes a 0 make sure both signs of zero are included. + if (fr.intersect (zeros) && !fr.undefined_p ()) +range.union_ (zeros); + } + // This vector maps a relation to the equivalent tree code. static const tree_code relation_to_code [VREL_LAST] = { diff --git a/gcc/value-relation.h b/gcc/value-relation.h index 3177ecb1ad0..6412cbbe98b 100644 --- a/gcc/value-relation.h +++ b/gcc/value-relation.h @@ -91,6 +91,9 @@ inline bool relation_equiv_p (relation_kind r) void print_relation (FILE *f, relation_kind rel); +// Adjust range as an equivalence. +void adjust_equivalence_range (vrange &range); + class relation_oracle { public:
Re: RISC-V: Support CORE-V XCVMAC and XCVALU extensions
On Wed, Oct 11, 2023 at 6:01 PM juzhe.zh...@rivai.ai wrote: > > ../../../../gcc/gcc/doc/extend.texi:21708: warning: node next `RISC-V Vector > Intrinsics' in menu `CORE-V Built-in Functions' and in sectioning `RX > Built-in Functions' differ > ../../../../gcc/gcc/doc/extend.texi:21716: warning: node `RX Built-in > Functions' is next for `CORE-V Built-in Functions' in menu but not in > sectioning > ../../../../gcc/gcc/doc/extend.texi:21716: warning: node `RISC-V Vector > Intrinsics' is prev for `CORE-V Built-in Functions' in menu but not in > sectioning > ../../../../gcc/gcc/doc/extend.texi:21716: warning: node up `CORE-V Built-in > Functions' in menu `Target Builtins' and in sectioning `RISC-V Vector > Intrinsics' differ > ../../../../gcc/gcc/doc/extend.texi:21708: node `RISC-V Vector Intrinsics' > lacks menu item for `CORE-V Built-in Functions' despite being its Up target > ../../../../gcc/gcc/doc/extend.texi:21889: warning: node prev `RX Built-in > Functions' in menu `CORE-V Built-in Functions' and in sectioning `RISC-V > Vector Intrinsics' differ > In file included from ../../../../gcc/gcc/gensupport.cc:26:0: > ../../../../gcc/gcc/rtl.h:66:26: warning: ‘rtx_def::code’ is too small to > hold all values of ‘enum rtx_code’ > #define RTX_CODE_BITSIZE 8 > ^ > ../../../../gcc/gcc/rtl.h:318:33: note: in expansion of macro > ‘RTX_CODE_BITSIZE’ >ENUM_BITFIELD(rtx_code) code: RTX_CODE_BITSIZE; > ^~~~ > > make[2]: *** [Makefile:3534: doc/gcc.info] Error 1 > make[2]: *** Waiting for unfinished jobs > rm gfdl.pod gcc.pod gcov-dump.pod gcov-tool.pod fsf-funding.pod gpl.pod > cpp.pod gcov.pod lto-dump.pod > make[2]: Leaving directory > '/work/home/jzzhong/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/build-gcc-newlib-stage1/gcc' > make[1]: *** [Makefile:4648: all-gcc] Error 2 > make[1]: Leaving directory > '/work/home/jzzhong/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/build-gcc-newlib-stage1' > make: *** [Makefile:590: stamps/build-gcc-newlib-stage1] Error 2 This is also recorded as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111777 . It breaks more than just RISCV; it depends on the version of texinfo that is installed too. Thanks, Andrew > > > juzhe.zh...@rivai.ai
[COMMITTED] PR tree-optimization/111622 - Do not add partial equivalences with no uses.
Technically PR 111622 exposes a bug in GCC 13, but its been papered over on trunk by this: commit 9ea74d235c7e7816b996a17c61288f02ef767985 Author: Richard Biener Date: Thu Sep 14 09:31:23 2023 +0200 tree-optimization/111294 - better DCE after forwprop This removes a lot of dead statements, but those statements were being added to the list of partial equivalences and causing some serious compile time issues. Rangers cache loops through equivalences when its propagating on-entry values, so if the partial equivalence list is very large, it can consume a lot of time. Typically, partial equivalence lists are small. In this case, a lot of dead stmts were not removed, so there was no redundancy elimination and it was causing an issue. This patch actually speeds things up a hair in the normal case too. Bootstrapped on x86_64-pc-linux-gnu with no regressions. pushed. Andrew
[COMMITTED] [GCC13] PR tree-optimization/111622 - Do not add partial equivalences with no uses.
There are a lot of dead statements in this testcase which a casts. These were being added to the list of partial equivalences and causing some serious compile time issues. Rangers cache loops through equivalences when its propagating on-entry values, so if the partial equivalence list is very large, it can consume a lot of time. Typically, partial equivalence lists are small. In this case, a lot of dead stmts were not removed, so there was no redundancy elimination and it was causing an issue. Bootstrapped on x86_64-pc-linux-gnu with no regressions. Pushed. Andrew From 425964b77ab5b9631e914965a7397303215c77a1 Mon Sep 17 00:00:00 2001 From: Andrew MacLeod Date: Thu, 12 Oct 2023 17:06:36 -0400 Subject: [PATCH] Do not add partial equivalences with no uses. PR tree-optimization/111622 * value-relation.cc (equiv_oracle::add_partial_equiv): Do not register a partial equivalence if an operand has no uses. --- gcc/value-relation.cc | 9 + 1 file changed, 9 insertions(+) diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc index fc792a4d5bc..0ed5f93d184 100644 --- a/gcc/value-relation.cc +++ b/gcc/value-relation.cc @@ -389,6 +389,9 @@ equiv_oracle::add_partial_equiv (relation_kind r, tree op1, tree op2) // In either case, if PE2 has an entry, we simply do nothing. if (pe2.members) return; + // If there are no uses of op2, do not register. + if (has_zero_uses (op2)) + return; // PE1 is the LHS and already has members, so everything in the set // should be a slice of PE2 rather than PE1. pe2.code = pe_min (r, pe1.code); @@ -406,6 +409,9 @@ equiv_oracle::add_partial_equiv (relation_kind r, tree op1, tree op2) } if (pe2.members) { + // If there are no uses of op1, do not register. + if (has_zero_uses (op1)) + return; pe1.ssa_base = pe2.ssa_base; // If pe2 is a 16 bit value, but only an 8 bit copy, we can't be any // more than an 8 bit equivalence here, so choose MIN value. @@ -415,6 +421,9 @@ equiv_oracle::add_partial_equiv (relation_kind r, tree op1, tree op2) } else { + // If there are no uses of either operand, do not register. + if (has_zero_uses (op1) || has_zero_uses (op2)) + return; // Neither name has an entry, simply create op1 as slice of op2. pe2.code = bits_to_pe (TYPE_PRECISION (TREE_TYPE (op2))); if (pe2.code == VREL_VARYING) -- 2.41.0
Re: [COMMITTED] PR tree-optimization/111622 - Do not add partial equivalences with no uses.
of course the patch would be handy... On 10/13/23 09:23, Andrew MacLeod wrote: Technically PR 111622 exposes a bug in GCC 13, but its been papered over on trunk by this: commit 9ea74d235c7e7816b996a17c61288f02ef767985 Author: Richard Biener Date: Thu Sep 14 09:31:23 2023 +0200 tree-optimization/111294 - better DCE after forwprop This removes a lot of dead statements, but those statements were being added to the list of partial equivalences and causing some serious compile time issues. Rangers cache loops through equivalences when its propagating on-entry values, so if the partial equivalence list is very large, it can consume a lot of time. Typically, partial equivalence lists are small. In this case, a lot of dead stmts were not removed, so there was no redundancy elimination and it was causing an issue. This patch actually speeds things up a hair in the normal case too. Bootstrapped on x86_64-pc-linux-gnu with no regressions. pushed. Andrew From 4eea3c1872a941089cafa105a11d8e40b1a55929 Mon Sep 17 00:00:00 2001 From: Andrew MacLeod Date: Thu, 12 Oct 2023 17:06:36 -0400 Subject: [PATCH] Do not add partial equivalences with no uses. PR tree-optimization/111622 * value-relation.cc (equiv_oracle::add_partial_equiv): Do not register a partial equivalence if an operand has no uses. --- gcc/value-relation.cc | 9 + 1 file changed, 9 insertions(+) diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc index 0326fe7cde6..c0f513a0eb1 100644 --- a/gcc/value-relation.cc +++ b/gcc/value-relation.cc @@ -392,6 +392,9 @@ equiv_oracle::add_partial_equiv (relation_kind r, tree op1, tree op2) // In either case, if PE2 has an entry, we simply do nothing. if (pe2.members) return; + // If there are no uses of op2, do not register. + if (has_zero_uses (op2)) + return; // PE1 is the LHS and already has members, so everything in the set // should be a slice of PE2 rather than PE1. pe2.code = pe_min (r, pe1.code); @@ -409,6 +412,9 @@ equiv_oracle::add_partial_equiv (relation_kind r, tree op1, tree op2) } if (pe2.members) { + // If there are no uses of op1, do not register. + if (has_zero_uses (op1)) + return; pe1.ssa_base = pe2.ssa_base; // If pe2 is a 16 bit value, but only an 8 bit copy, we can't be any // more than an 8 bit equivalence here, so choose MIN value. @@ -418,6 +424,9 @@ equiv_oracle::add_partial_equiv (relation_kind r, tree op1, tree op2) } else { + // If there are no uses of either operand, do not register. + if (has_zero_uses (op1) || has_zero_uses (op2)) + return; // Neither name has an entry, simply create op1 as slice of op2. pe2.code = bits_to_pe (TYPE_PRECISION (TREE_TYPE (op2))); if (pe2.code == VREL_VARYING) -- 2.41.0
[PATCH] MATCH: [PR111432] Simplify `a & (x | CST)` to a when we know that (a & ~CST) == 0
This adds the simplification `a & (x | CST)` to a when we know that `(a & ~CST) == 0`. In a similar fashion as `a & CST` is handle. I looked into handling `a | (x & CST)` but that I don't see any decent simplifications happening. OK? Bootstrapped and tested on x86_linux-gnu with no regressions. PR tree-optimization/111432 gcc/ChangeLog: * match.pd (`a & (x | CST)`): New pattern. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/bitops-7.c: New test. --- gcc/match.pd | 8 gcc/testsuite/gcc.dg/tree-ssa/bitops-7.c | 24 2 files changed, 32 insertions(+) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/bitops-7.c diff --git a/gcc/match.pd b/gcc/match.pd index 51e5065d086..45624f3dcb4 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -1550,6 +1550,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) && wi::bit_and_not (get_nonzero_bits (@0), wi::to_wide (@1)) == 0) @0)) + +/* `a & (x | CST)` -> a if we know that (a & ~CST) == 0 */ +(simplify + (bit_and:c SSA_NAME@0 (bit_ior @1 INTEGER_CST@2)) + (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) + && wi::bit_and_not (get_nonzero_bits (@0), wi::to_wide (@2)) == 0) + @0)) + /* x | C -> C if we know that x & ~C == 0. */ (simplify (bit_ior SSA_NAME@0 INTEGER_CST@1) diff --git a/gcc/testsuite/gcc.dg/tree-ssa/bitops-7.c b/gcc/testsuite/gcc.dg/tree-ssa/bitops-7.c new file mode 100644 index 000..7fb18db3a11 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/bitops-7.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-options "-O1 -fdump-tree-optimized-raw" } */ +/* PR tree-optimization/111432 */ + +int +foo3(int c, int bb) +{ + if ((bb & ~3)!=0) __builtin_unreachable(); + return (bb & (c|3)); +} + +int +foo_bool(int c, _Bool bb) +{ + return (bb & (c|7)); +} + +/* Both of these functions should be able to remove the `IOR` and `AND` + as the only bits that are non-zero for bb is set on the other side + of the `AND`. + */ + +/* { dg-final { scan-tree-dump-not "bit_ior_expr, " "optimized" } } */ +/* { dg-final { scan-tree-dump-not "bit_and_expr, " "optimized" } } */ -- 2.39.3
[PATCH 2/2] [c] Fix PR 101364: ICE after error due to diagnose_arglist_conflict not checking for error
When checking to see if we have a function declaration has a conflict due to promotations, there is no test to see if the type was an error mark and then calls c_type_promotes_to. c_type_promotes_to is not ready for error_mark and causes an ICE. This adds a check for error before the call of c_type_promotes_to. OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions. PR c/101364 gcc/c/ChangeLog: * c-decl.cc (diagnose_arglist_conflict): Test for error mark before calling of c_type_promotes_to. gcc/testsuite/ChangeLog: * gcc.dg/pr101364-1.c: New test. --- gcc/c/c-decl.cc | 3 ++- gcc/testsuite/gcc.dg/pr101364-1.c | 8 2 files changed, 10 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.dg/pr101364-1.c diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc index 5822faf01b4..eb2df08c0a7 100644 --- a/gcc/c/c-decl.cc +++ b/gcc/c/c-decl.cc @@ -1899,7 +1899,8 @@ diagnose_arglist_conflict (tree newdecl, tree olddecl, break; } - if (c_type_promotes_to (type) != type) + if (!error_operand_p (type) + && c_type_promotes_to (type) != type) { inform (input_location, "an argument type that has a default " "promotion cannot match an empty parameter name list " diff --git a/gcc/testsuite/gcc.dg/pr101364-1.c b/gcc/testsuite/gcc.dg/pr101364-1.c new file mode 100644 index 000..e7c94a05553 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr101364-1.c @@ -0,0 +1,8 @@ +/* { dg-do compile } */ +/* { dg-options "-std=c90 "} */ + +void fruit(); /* { dg-message "previous declaration" } */ +void fruit( /* { dg-error "conflicting types for" } */ +int b[x], /* { dg-error "undeclared " } */ +short c) +{} /* { dg-message "an argument type that has a" } */ -- 2.39.3
[PATCH 1/2] Fix ICE due to c_safe_arg_type_equiv_p not checking for error_mark node
This is a simple error recovery issue when c_safe_arg_type_equiv_p was added in r8-5312-gc65e18d3331aa999. The issue is that after an error, an argument type (of a function type) might turn into an error mark node and c_safe_arg_type_equiv_p was not ready for that. So this just adds a check for error operand for its arguments before getting the main variant. OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions. PR c/101285 gcc/c/ChangeLog: * c-typeck.cc (c_safe_arg_type_equiv_p): Return true for error operands early. gcc/testsuite/ChangeLog: * gcc.dg/pr101285-1.c: New test. --- gcc/c/c-typeck.cc | 3 +++ gcc/testsuite/gcc.dg/pr101285-1.c | 10 ++ 2 files changed, 13 insertions(+) create mode 100644 gcc/testsuite/gcc.dg/pr101285-1.c diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc index e55e887da14..6e044b4afbc 100644 --- a/gcc/c/c-typeck.cc +++ b/gcc/c/c-typeck.cc @@ -5960,6 +5960,9 @@ handle_warn_cast_qual (location_t loc, tree type, tree otype) static bool c_safe_arg_type_equiv_p (tree t1, tree t2) { + if (error_operand_p (t1) || error_operand_p (t2)) +return true; + t1 = TYPE_MAIN_VARIANT (t1); t2 = TYPE_MAIN_VARIANT (t2); diff --git a/gcc/testsuite/gcc.dg/pr101285-1.c b/gcc/testsuite/gcc.dg/pr101285-1.c new file mode 100644 index 000..831e35f7662 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr101285-1.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-W -Wall" } */ +const int b; +typedef void (*ft1)(int[b++]); /* { dg-error "read-only variable" } */ +void bar(int * z); +void baz() +{ +(ft1) bar; /* { dg-warning "statement with no effect" } */ +} + -- 2.39.3
[PATCH] MATCH: Improve `A CMP 0 ? A : -A` set of patterns to use bitwise_equal_p.
This improves the `A CMP 0 ? A : -A` set of match patterns to use bitwise_equal_p which allows an nop cast between signed and unsigned. This allows catching a few extra cases which were not being caught before. OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions. gcc/ChangeLog: PR tree-optimization/101541 * match.pd (A CMP 0 ? A : -A): Improve using bitwise_equal_p. gcc/testsuite/ChangeLog: PR tree-optimization/101541 * gcc.dg/tree-ssa/phi-opt-36.c: New test. * gcc.dg/tree-ssa/phi-opt-37.c: New test. --- gcc/match.pd | 49 - gcc/testsuite/gcc.dg/tree-ssa/phi-opt-36.c | 51 ++ gcc/testsuite/gcc.dg/tree-ssa/phi-opt-37.c | 24 ++ 3 files changed, 104 insertions(+), 20 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-36.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-37.c diff --git a/gcc/match.pd b/gcc/match.pd index 45624f3dcb4..142e2dfbeb1 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -5668,42 +5668,51 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) /* A == 0 ? A : -Asame as -A */ (for cmp (eq uneq) (simplify - (cnd (cmp @0 zerop) @0 (negate@1 @0)) -(if (!HONOR_SIGNED_ZEROS (type)) + (cnd (cmp @0 zerop) @2 (negate@1 @2)) +(if (!HONOR_SIGNED_ZEROS (type) +&& bitwise_equal_p (@0, @2)) @1)) (simplify - (cnd (cmp @0 zerop) zerop (negate@1 @0)) -(if (!HONOR_SIGNED_ZEROS (type)) + (cnd (cmp @0 zerop) zerop (negate@1 @2)) +(if (!HONOR_SIGNED_ZEROS (type) +&& bitwise_equal_p (@0, @2)) @1)) ) /* A != 0 ? A : -Asame as A */ (for cmp (ne ltgt) (simplify - (cnd (cmp @0 zerop) @0 (negate @0)) -(if (!HONOR_SIGNED_ZEROS (type)) - @0)) + (cnd (cmp @0 zerop) @1 (negate @1)) +(if (!HONOR_SIGNED_ZEROS (type) +&& bitwise_equal_p (@0, @1)) + @1)) (simplify - (cnd (cmp @0 zerop) @0 integer_zerop) -(if (!HONOR_SIGNED_ZEROS (type)) - @0)) + (cnd (cmp @0 zerop) @1 integer_zerop) +(if (!HONOR_SIGNED_ZEROS (type) +&& bitwise_equal_p (@0, @1)) + @1)) ) /* A >=/> 0 ? A : -Asame as abs (A) */ (for cmp (ge gt) (simplify - (cnd (cmp @0 zerop) @0 (negate @0)) -(if (!HONOR_SIGNED_ZEROS (type) -&& !TYPE_UNSIGNED (type)) - (abs @0 + (cnd (cmp @0 zerop) @1 (negate @1)) +(if (!HONOR_SIGNED_ZEROS (TREE_TYPE(@0)) +&& !TYPE_UNSIGNED (TREE_TYPE(@0)) +&& bitwise_equal_p (@0, @1)) + (if (TYPE_UNSIGNED (type)) + (absu:type @0) + (abs @0) /* A <=/< 0 ? A : -Asame as -abs (A) */ (for cmp (le lt) (simplify - (cnd (cmp @0 zerop) @0 (negate @0)) -(if (!HONOR_SIGNED_ZEROS (type) -&& !TYPE_UNSIGNED (type)) - (if (ANY_INTEGRAL_TYPE_P (type) - && !TYPE_OVERFLOW_WRAPS (type)) + (cnd (cmp @0 zerop) @1 (negate @1)) +(if (!HONOR_SIGNED_ZEROS (TREE_TYPE(@0)) +&& !TYPE_UNSIGNED (TREE_TYPE(@0)) +&& bitwise_equal_p (@0, @1)) + (if ((ANY_INTEGRAL_TYPE_P (TREE_TYPE (@0)) + && !TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))) + || TYPE_UNSIGNED (type)) (with { - tree utype = unsigned_type_for (type); + tree utype = unsigned_type_for (TREE_TYPE(@0)); } (convert (negate (absu:utype @0 (negate (abs @0) diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-36.c b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-36.c new file mode 100644 index 000..4baf9f82a22 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-36.c @@ -0,0 +1,51 @@ +/* { dg-options "-O2 -fdump-tree-phiopt" } */ + +unsigned f0(int A) +{ + unsigned t = A; +// A == 0? A : -Asame as -A + if (A == 0) return t; + return -t; +} + +unsigned f1(int A) +{ + unsigned t = A; +// A != 0? A : -Asame as A + if (A != 0) return t; + return -t; +} +unsigned f2(int A) +{ + unsigned t = A; +// A >= 0? A : -Asame as abs (A) + if (A >= 0) return t; + return -t; +} +unsigned f3(int A) +{ + unsigned t = A; +// A > 0? A : -Asame as abs (A) + if (A > 0) return t; + return -t; +} +unsigned f4(int A) +{ + unsigned t = A; +// A <= 0? A : -Asame as -abs (A) + if (A <= 0) return t; + return -t; +} +unsigned f5(int A) +{ + unsigned t = A; +// A < 0? A : -Asame as -abs (A) + if (A < 0) return t; + return -t; +} + +/* f4 and f5 are not allowed to be optimized in early phi-opt. */ +/* { dg-final { scan-tree-dump-times "if " 2 "phiopt1" } } */ +/* { dg-final { scan-tree-dump-not "if " "phiopt2" } } */ + + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-37.c b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-37.c new file mode 100644 index 000..f1ff472aaff --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-37.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-options "-O1 -fdump-tree-phiopt1" } */ + +unsigned abs_with_convert0 (int x) +{ +unsigned int y = x
[PATCH] Improve factor_out_conditional_operation for conversions and constants
In the case of a NOP conversion (precisions of the 2 types are equal), factoring out the conversion can be done even if int_fits_type_p returns false and even when the conversion is defined by a statement inside the conditional. Since it is a NOP conversion there is no zero/sign extending happening which is why it is ok to be done here. OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions. gcc/ChangeLog: PR tree-optimization/104376 PR tree-optimization/101541 * tree-ssa-phiopt.cc (factor_out_conditional_operation): Allow nop conversions even if it is defined by a statement inside the conditional. gcc/testsuite/ChangeLog: PR tree-optimization/101541 * gcc.dg/tree-ssa/phi-opt-38.c: New test. --- gcc/testsuite/gcc.dg/tree-ssa/phi-opt-38.c | 44 ++ gcc/tree-ssa-phiopt.cc | 8 +++- 2 files changed, 50 insertions(+), 2 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-38.c diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-38.c b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-38.c new file mode 100644 index 000..ca04d1619e6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-38.c @@ -0,0 +1,44 @@ +/* { dg-options "-O2 -fdump-tree-phiopt" } */ + +unsigned f0(int A) +{ +// A == 0? A : -Asame as -A + if (A == 0) return A; + return -A; +} + +unsigned f1(int A) +{ +// A != 0? A : -Asame as A + if (A != 0) return A; + return -A; +} +unsigned f2(int A) +{ +// A >= 0? A : -Asame as abs (A) + if (A >= 0) return A; + return -A; +} +unsigned f3(int A) +{ +// A > 0? A : -Asame as abs (A) + if (A > 0) return A; + return -A; +} +unsigned f4(int A) +{ +// A <= 0? A : -Asame as -abs (A) + if (A <= 0) return A; + return -A; +} +unsigned f5(int A) +{ +// A < 0? A : -Asame as -abs (A) + if (A < 0) return A; + return -A; +} + +/* f4 and f5 are not allowed to be optimized in early phi-opt. */ +/* { dg-final { scan-tree-dump-times "if" 2 "phiopt1" } } */ +/* { dg-final { scan-tree-dump-not "if" "phiopt2" } } */ + diff --git a/gcc/tree-ssa-phiopt.cc b/gcc/tree-ssa-phiopt.cc index 312a6f9082b..0ab8fad5898 100644 --- a/gcc/tree-ssa-phiopt.cc +++ b/gcc/tree-ssa-phiopt.cc @@ -310,7 +310,9 @@ factor_out_conditional_operation (edge e0, edge e1, gphi *phi, return NULL; /* If arg1 is an INTEGER_CST, fold it to new type. */ if (INTEGRAL_TYPE_P (TREE_TYPE (new_arg0)) - && int_fits_type_p (arg1, TREE_TYPE (new_arg0))) + && (int_fits_type_p (arg1, TREE_TYPE (new_arg0)) + || TYPE_PRECISION (TREE_TYPE (new_arg0)) + == TYPE_PRECISION (TREE_TYPE (arg1 { if (gimple_assign_cast_p (arg0_def_stmt)) { @@ -323,7 +325,9 @@ factor_out_conditional_operation (edge e0, edge e1, gphi *phi, its basic block, because then it is possible this could enable further optimizations (minmax replacement etc.). See PR71016. */ - if (new_arg0 != gimple_cond_lhs (cond_stmt) + if (TYPE_PRECISION (TREE_TYPE (new_arg0)) + != TYPE_PRECISION (TREE_TYPE (arg1)) + && new_arg0 != gimple_cond_lhs (cond_stmt) && new_arg0 != gimple_cond_rhs (cond_stmt) && gimple_bb (arg0_def_stmt) == e0->src) { -- 2.34.1
[PATCH] [PR31531] MATCH: Improve ~a < ~b and ~a < CST, allow a nop cast inbetween ~ and a/b
Currently we able to simplify `~a CMP ~b` to `b CMP a` but we should allow a nop conversion in between the `~` and the `a` which can show up. A similarly thing should be done for `~a CMP CST`. I had originally submitted the `~a CMP CST` case as https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585088.html; I noticed we should do the same thing for the `~a CMP ~b` case and combined it with that one here. OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions. PR tree-optimization/31531 gcc/ChangeLog: * match.pd (~X op ~Y): Allow for an optional nop convert. (~X op C): Likewise. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/pr31531-1.c: New test. * gcc.dg/tree-ssa/pr31531-2.c: New test. --- gcc/match.pd | 10 --- gcc/testsuite/gcc.dg/tree-ssa/pr31531-1.c | 19 + gcc/testsuite/gcc.dg/tree-ssa/pr31531-2.c | 34 +++ 3 files changed, 59 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr31531-1.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr31531-2.c diff --git a/gcc/match.pd b/gcc/match.pd index 51e5065d086..e76ec1ec034 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -5944,18 +5944,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) /* Fold ~X op ~Y as Y op X. */ (for cmp (simple_comparison) (simplify - (cmp (bit_not@2 @0) (bit_not@3 @1)) + (cmp (nop_convert1?@4 (bit_not@2 @0)) (nop_convert2? (bit_not@3 @1))) (if (single_use (@2) && single_use (@3)) - (cmp @1 @0 + (with { tree otype = TREE_TYPE (@4); } +(cmp (convert:otype @1) (convert:otype @0)) /* Fold ~X op C as X op' ~C, where op' is the swapped comparison. */ (for cmp (simple_comparison) scmp (swapped_simple_comparison) (simplify - (cmp (bit_not@2 @0) CONSTANT_CLASS_P@1) + (cmp (nop_convert? (bit_not@2 @0)) CONSTANT_CLASS_P@1) (if (single_use (@2) && (TREE_CODE (@1) == INTEGER_CST || TREE_CODE (@1) == VECTOR_CST)) - (scmp @0 (bit_not @1) + (with { tree otype = TREE_TYPE (@1); } +(scmp (convert:otype @0) (bit_not @1)) (for cmp (simple_comparison) (simplify diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr31531-1.c b/gcc/testsuite/gcc.dg/tree-ssa/pr31531-1.c new file mode 100644 index 000..c27299151eb --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr31531-1.c @@ -0,0 +1,19 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ +/* PR tree-optimization/31531 */ + +int f(int a) +{ + int b = ~a; + return b<0; +} + + +int f1(unsigned a) +{ + int b = ~a; + return b<0; +} +/* We should convert the above two functions from b <0 to ((int)a) >= 0. */ +/* { dg-final { scan-tree-dump-times ">= 0" 2 "optimized"} } */ +/* { dg-final { scan-tree-dump-times "~" 0 "optimized"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr31531-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr31531-2.c new file mode 100644 index 000..865ea292215 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr31531-2.c @@ -0,0 +1,34 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ +/* PR tree-optimization/31531 */ + +int f0(unsigned x, unsigned t) +{ +x = ~x; +t = ~t; +int xx = x; +int tt = t; +return tt < xx; +} + +int f1(unsigned x, int t) +{ +x = ~x; +t = ~t; +int xx = x; +int tt = t; +return tt < xx; +} + +int f2(int x, unsigned t) +{ +x = ~x; +t = ~t; +int xx = x; +int tt = t; +return tt < xx; +} + + +/* We should be able to remove all ~ from the above functions. */ +/* { dg-final { scan-tree-dump-times "~" 0 "optimized"} } */ -- 2.39.3
Re: [PATCH] Add files to discourage submissions of PRs to the GitHub mirror.
On Mon, Oct 16, 2023, 16:39 Eric Gallager wrote: > Currently there is an unofficial mirror of GCC on GitHub that people > sometimes submit pull requests to: > https://github.com/gcc-mirror/gcc > However, this is not the proper way to contribute to GCC, so that means > that someone (usually Jonathan Wakely) has to go through the PRs and > manually tell people that they're sending their PRs to the wrong place. > One thing that would help mitigate this problem would be files in a > special .github directory that GitHub would automatically open when > contributors attempt to open a PR, that would then tell them the proper > way to contribute instead. This patch attempts to add two such files. > They are written in Markdown, which I'm realizing might require some > special handling in this repository, since the ".md" extension is also > used for GCC's "Machine Description" files here, but I'm not quite sure > how to go about handling that. Also note that I adapted these files from > equivalent files in the git repository for Git itself: > https://github.com/git/git/blob/master/.github/CONTRIBUTING.md > https://github.com/git/git/blob/master/.github/PULL_REQUEST_TEMPLATE.md > What do people think? > I think this is a great idea. Is a similar one for opening issues too? Thanks, Andrew ChangeLog: > > * .github/CONTRIBUTING.md: New file. > * .github/PULL_REQUEST_TEMPLATE.md: New file. > --- > .github/CONTRIBUTING.md | 18 ++ > .github/PULL_REQUEST_TEMPLATE.md | 5 + > 2 files changed, 23 insertions(+) > create mode 100644 .github/CONTRIBUTING.md > create mode 100644 .github/PULL_REQUEST_TEMPLATE.md > > diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md > new file mode 100644 > index ..4f7b3abca5f4 > --- /dev/null > +++ b/.github/CONTRIBUTING.md > @@ -0,0 +1,18 @@ > +## Contributing to GCC > + > +Thanks for taking the time to contribute to GCC! Please be advised that > if you are > +viewing this on `github.com`, that the mirror there is unofficial and > unmonitored. > +The GCC community does not use `github.com` for their contributions. > Instead, we use > +a mailing list (`gcc-patches@gcc.gnu.org`) for code submissions, code > +reviews, and bug reports. > + > +Perhaps one day it will be possible to use [GitGitGadget]( > https://gitgitgadget.github.io/) to > +conveniently send Pull Requests commits to GCC's mailing list, the way > that the Git project currently allows it to be used to send PRs to their > mailing list, but until that day arrives, please send your patches to the > mailing list manually. > + > +Please read ["Contributing to GCC"](https://gcc.gnu.org/contribute.html) > on the main GCC website > +to learn how the GCC project is managed, and how you can work with it. > +In addition, we highly recommend you to read [our guidelines for > read-write Git access](https://gcc.gnu.org/gitwrite.html). > + > +Or, you can follow the ["Contributing to GCC in 10 easy steps"]( > https://gcc.gnu.org/wiki/GettingStarted#Basics:_Contributing_to_GCC_in_10_easy_steps) > section of the ["Getting Started" page]( > https://gcc.gnu.org/wiki/GettingStarted) on [the wiki]( > https://gcc.gnu.org/wiki) for another example of the contribution process. > + > +Your friendly GCC community! > diff --git a/.github/PULL_REQUEST_TEMPLATE.md > b/.github/PULL_REQUEST_TEMPLATE.md > new file mode 100644 > index ..6417392c8cf3 > --- /dev/null > +++ b/.github/PULL_REQUEST_TEMPLATE.md > @@ -0,0 +1,5 @@ > +Thanks for taking the time to contribute to GCC! Please be advised that > if you are > +viewing this on `github.com`, that the mirror there is unofficial and > unmonitored. > +The GCC community does not use `github.com` for their contributions. > Instead, we use > +a mailing list (`gcc-patches@gcc.gnu.org`) for code submissions, code > reviews, and > +bug reports. Please send patches there instead. >
Re: [PATCH 11/11] aarch64: Add new load/store pair fusion pass.
On Tue, Oct 17, 2023 at 1:52 PM Alex Coplan wrote: > > This adds a new aarch64-specific RTL-SSA pass dedicated to forming load > and store pairs (LDPs and STPs). > > As a motivating example for the kind of thing this improves, take the > following testcase: > > extern double c[20]; > > double f(double x) > { > double y = x*x; > y += c[16]; > y += c[17]; > y += c[18]; > y += c[19]; > return y; > } > > for which we currently generate (at -O2): > > f: > adrpx0, c > add x0, x0, :lo12:c > ldp d31, d29, [x0, 128] > ldr d30, [x0, 144] > fmadd d0, d0, d0, d31 > ldr d31, [x0, 152] > faddd0, d0, d29 > faddd0, d0, d30 > faddd0, d0, d31 > ret > > but with the pass, we generate: > > f: > .LFB0: > adrpx0, c > add x0, x0, :lo12:c > ldp d31, d29, [x0, 128] > fmadd d0, d0, d0, d31 > ldp d30, d31, [x0, 144] > faddd0, d0, d29 > faddd0, d0, d30 > faddd0, d0, d31 > ret > > The pass is local (only considers a BB at a time). In theory, it should > be possible to extend it to run over EBBs, at least in the case of pure > (MEM_READONLY_P) loads, but this is left for future work. > > The pass works by identifying two kinds of bases: tree decls obtained > via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos. > If a candidate memory access has a MEM_EXPR base, then we track it via > this base, and otherwise if it is of a simple reg + form, we track > it via the RTL-SSA def_info for the register. > > For each BB, for a given kind of base, we build up a hash table mapping > the base to an access_group. The access_group data structure holds a > list of accesses at each offset relative to the same base. It uses a > splay tree to support efficient insertion (while walking the bb), and > the nodes are chained using a linked list to support efficient > iteration (while doing the transformation). > > For each base, we then iterate over the access_group to identify > adjacent accesses, and try to form load/store pairs for those insns that > access adjacent memory. > > The pass is currently run twice, both before and after register > allocation. The first copy of the pass is run late in the pre-RA RTL > pipeline, immediately after sched1, since it was found that sched1 was > increasing register pressure when the pass was run before. The second > copy of the pass runs immediately before peephole2, so as to get any > opportunities that the existing ldp/stp peepholes can handle. > > There are some cases that we punt on before RA, e.g. > accesses relative to eliminable regs (such as the soft frame pointer). > We do this since we can't know the elimination offset before RA, and we > want to avoid the RA reloading the offset (due to being out of ldp/stp > immediate range) as this can generate worse code. > > The post-RA copy of the pass is there to pick up the crumbs that were > left behind / things we punted on in the pre-RA pass. Among other > things, it's needed to handle accesses relative to the stack pointer > (see the previous patch in the series for an example). It can also > handle code that didn't exist at the time the pre-RA pass was run (spill > code, prologue/epilogue code). > > The following table shows the effect of the passes on code size in > SPEC CPU 2017 with -Os -flto=auto -mcpu=neoverse-v1: > > +-+-+--+-+ > |Benchmark| Pre-RA pass | Post-RA pass | Overall | > +-+-+--+-+ > | 541.leela_r | 0.04% | -0.03% | 0.01% | > | 502.gcc_r | -0.07% | -0.02% | -0.09% | > | 510.parest_r| -0.06% | -0.04% | -0.10% | > | 505.mcf_r | -0.12% | 0.00%| -0.12% | > | 500.perlbench_r | -0.12% | -0.02% | -0.15% | > | 520.omnetpp_r | -0.13% | -0.03% | -0.16% | > | 538.imagick_r | -0.17% | -0.02% | -0.19% | > | 525.x264_r | -0.17% | -0.02% | -0.19% | > | 544.nab_r | -0.22% | -0.01% | -0.23% | > | 557.xz_r| -0.27% | -0.01% | -0.28% | > | 507.cactuBSSN_r | -0.26% | -0.03% | -0.29% | > | 526.blender_r | -0.37% | -0.02% | -0.38% | > | 523.xalancbmk_r | -0.41% | -0.01% | -0.42% | > | 531.deepsjeng_r | -0.41% | -0.05% | -0.46% | > | 511.povray_r| -0.60% | -0.05% | -0.65% | > | 548.exchange2_r | -0.55% | -0.32% | -0.86% | > | 527.cam4_r | -0.82% | -0.16% | -0.98% | > | 503.bwaves_r| -0.63% | -0.41% | -1.04% | > | 521.wrf_r | -1.04% | -0.06% | -1.10% | > | 549.fotonik3d_r | -0.91% | -0.35% | -1.26% | > | 554.roms_r | -1.20% | -0.20% | -1.40% | > | 519.lbm_r | -1.91% | 0.00%| -1
aarch64: Replace duplicated selftests
Pushed as obvious. gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_test_fractional_cost): Test <= instead of testing < twice. diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 2b0de7ca0389be6698c329b54f9501b8ec09183f..9c3c0e705e2e6ea3b55b4a5f1e7d3360f91eb51d 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -27529,18 +27529,18 @@ aarch64_test_fractional_cost () ASSERT_EQ (cf (2, 3) * 5, cf (10, 3)); ASSERT_EQ (14 * cf (11, 21), cf (22, 3)); - ASSERT_TRUE (cf (4, 15) < cf (5, 15)); - ASSERT_FALSE (cf (5, 15) < cf (5, 15)); - ASSERT_FALSE (cf (6, 15) < cf (5, 15)); - ASSERT_TRUE (cf (1, 3) < cf (2, 5)); - ASSERT_TRUE (cf (1, 12) < cf (1, 6)); - ASSERT_FALSE (cf (5, 3) < cf (5, 3)); - ASSERT_TRUE (cf (239, 240) < 1); - ASSERT_FALSE (cf (240, 240) < 1); - ASSERT_FALSE (cf (241, 240) < 1); - ASSERT_FALSE (2 < cf (207, 104)); - ASSERT_FALSE (2 < cf (208, 104)); - ASSERT_TRUE (2 < cf (209, 104)); + ASSERT_TRUE (cf (4, 15) <= cf (5, 15)); + ASSERT_TRUE (cf (5, 15) <= cf (5, 15)); + ASSERT_FALSE (cf (6, 15) <= cf (5, 15)); + ASSERT_TRUE (cf (1, 3) <= cf (2, 5)); + ASSERT_TRUE (cf (1, 12) <= cf (1, 6)); + ASSERT_TRUE (cf (5, 3) <= cf (5, 3)); + ASSERT_TRUE (cf (239, 240) <= 1); + ASSERT_TRUE (cf (240, 240) <= 1); + ASSERT_FALSE (cf (241, 240) <= 1); + ASSERT_FALSE (2 <= cf (207, 104)); + ASSERT_TRUE (2 <= cf (208, 104)); + ASSERT_TRUE (2 <= cf (209, 104)); ASSERT_TRUE (cf (4, 15) < cf (5, 15)); ASSERT_FALSE (cf (5, 15) < cf (5, 15));
[0/3] target_version and aarch64 function multiversioning
This series adds support for function multiversioning on aarch64. There are a few minor issues in patch 2/3, that I intend to fix in future versions or follow-up patches. I also have some open questions about the correctness of existing function multiversioning implementations [1], that could affect some details of this patch series. Patches 1/3 and 2/3 both pass regression testing on x86. Patch 2/3 requires adding function multiversioning tests to aarch64, which I haven't included yet. Patch 3/3 demonstrates a potential approach for improving consistency of symbol naming between target_clones and target/target_version multiversioning, but would require agreement on how to resolve some of the issues discussed in [1]. Thanks, Andrew [1] https://gcc.gnu.org/pipermail/gcc/2023-October/242686.html
[1/3] Add support for target_version attribute
This patch adds support for the "target_version" attribute to the middle end and the C++ frontend, which will be used to implement function multiversioning in the aarch64 backend. Note that C++ is currently the only frontend which supports multiversioning using the "target" attribute, whereas the "target_clones" attribute is additionally supported in C, D and Ada. Support for the target_version attribute will be extended to C at a later date. Targets that currently use the "target" attribute for function multiversioning (i.e. i386 and rs6000) are not affected by this patch. I could have implemented the target hooks slightly differently, by reusing the valid_attribute_p hook and adding attribute name checks to each backend implementation (c.f. the aarch64 implementation in patch 2/3). Would this be preferable? Otherwise, is this ok for master? gcc/c-family/ChangeLog: * c-attribs.cc (handle_target_version_attribute): New. (c_common_attribute_table): Add target_version. (handle_target_clones_attribute): Add conflict with target_version attribute. gcc/ChangeLog: * attribs.cc (is_function_default_version): Update comment to specify incompatibility with target_version attributes. * cgraphclones.cc (cgraph_node::create_version_clone_with_body): Call valid_version_attribute_p for target_version attributes. * target.def (valid_version_attribute_p): New hook. (expanded_clones_attribute): New hook. * doc/tm.texi.in: Add new hooks. * doc/tm.texi: Regenerate. * multiple_target.cc (create_dispatcher_calls): Remove redundant is_function_default_version check. (expand_target_clones): Use target hook for attribute name. * targhooks.cc (default_target_option_valid_version_attribute_p): New. * targhooks.h (default_target_option_valid_version_attribute_p): New. * tree.h (DECL_FUNCTION_VERSIONED): Update comment to include target_version attributes. gcc/cp/ChangeLog: * decl2.cc (check_classfn): Update comment to include target_version attributes. diff --git a/gcc/attribs.cc b/gcc/attribs.cc index b1300018d1e8ed8e02ded1ea721dc192a6d32a49..a3c4a81e8582ea4fd06b9518bf51fad7c998ddd6 100644 --- a/gcc/attribs.cc +++ b/gcc/attribs.cc @@ -1233,8 +1233,9 @@ make_dispatcher_decl (const tree decl) return func_decl; } -/* Returns true if decl is multi-versioned and DECL is the default function, - that is it is not tagged with target specific optimization. */ +/* Returns true if DECL is multi-versioned using the target attribute, and this + is the default version. This function can only be used for targets that do + not support the "target_version" attribute. */ bool is_function_default_version (const tree decl) diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc index 072cfb69147bd6b314459c0bd48a0c1fb92d3e4d..1a224c036277d51ab4dc0d33a403177bd226e48a 100644 --- a/gcc/c-family/c-attribs.cc +++ b/gcc/c-family/c-attribs.cc @@ -148,6 +148,7 @@ static tree handle_alloc_align_attribute (tree *, tree, tree, int, bool *); static tree handle_assume_aligned_attribute (tree *, tree, tree, int, bool *); static tree handle_assume_attribute (tree *, tree, tree, int, bool *); static tree handle_target_attribute (tree *, tree, tree, int, bool *); +static tree handle_target_version_attribute (tree *, tree, tree, int, bool *); static tree handle_target_clones_attribute (tree *, tree, tree, int, bool *); static tree handle_optimize_attribute (tree *, tree, tree, int, bool *); static tree ignore_attribute (tree *, tree, tree, int, bool *); @@ -480,6 +481,8 @@ const struct attribute_spec c_common_attribute_table[] = handle_error_attribute, NULL }, { "target", 1, -1, true, false, false, false, handle_target_attribute, NULL }, + { "target_version", 1, -1, true, false, false, false, + handle_target_version_attribute, NULL }, { "target_clones", 1, -1, true, false, false, false, handle_target_clones_attribute, NULL }, { "optimize", 1, -1, true, false, false, false, @@ -5569,6 +5572,45 @@ handle_target_attribute (tree *node, tree name, tree args, int flags, return NULL_TREE; } +/* Handle a "target_version" attribute. */ + +static tree +handle_target_version_attribute (tree *node, tree name, tree args, int flags, + bool *no_add_attrs) +{ + /* Ensure we have a function type. */ + if (TREE_CODE (*node) != FUNCTION_DECL) +{ + warning (OPT_Wattributes, "%qE attribute ignored", name); + *no_add_attrs = true; +} + else if (lookup_attribute ("target_clones", DECL_ATTRIBUTES (*node))) +{ + warning (OPT_Wattributes, "%qE attribute ignored due to conflict " + "with %qs attribute
[2/3] [aarch64] Add function multiversioning support
This adds initial support for function multiversion on aarch64 using the target_version and target_clones attributes. This mostly follows the Beta specification in the ACLE [1], with a few diffences that remain to be fixed: - Symbol mangling for target_clones differs from that for target_version and does not match the mangling specified in the ACLE. This inconsistency is also present in i386 and rs6000 mangling. - The target_clones attribute does not currently support an implicit "default" version. - Unrecognised target names in a target_clones attribute should be ignored (with an optional warning), but currently cause an error to be raised instead. - There is no option to disable function multiversioning at compile time. - There is no support for function multiversioning in C, since this is not yet enabled in the frontend. On the other hand, this patch happens to enable multiversioning in Ada and D as well, using their existing frontend support. This patch relies on adding functionality to libgcc, to support: - struct { unsigned long long features; } __aarch64_cpu_features; - void __init_cpu_features (void); - void __init_cpu_features_resolver (unsigned long hwcap, const __ifunc_arg_t *arg); This support matches the interface currently used in LLVM's compiler-rt, and will be implemented in a future patch (which will be merged before merging this patch). This version of the patch incorrectly uses __init_cpu_features in the ifunc resolvers, which could lead to invalid library calls at load time. I will fix this to use __init_cpu_features_resolver in a future version of the patch. [1] https://github.com/ARM-software/acle/blob/main/main/acle.md#function-multi-versioning gcc/ChangeLog: * attribs.cc (decl_attributes): Pass attribute name to target hook. * config/aarch64/aarch64.cc (aarch64_process_target_version_attr): New. (aarch64_option_valid_attribute_p): Add check and support for target_version attribute. (enum CPUFeatures): New list of for bitmask positions. (aarch64_fmv_feature_data): New. (get_feature_bit): New. (get_feature_mask_for_version): New. (compare_feature_masks): New. (aarch64_compare_version_priority): New. (make_resolver_func): New. (add_condition_to_bb): New. (compare_feature_version_info): New. (dispatch_function_versions): New. (aarch64_generate_version_dispatcher_body): New. (aarch64_get_function_versions_dispatcher): New. (aarch64_common_function_versions): New. (aarch64_mangle_decl_assembler_name): New. (TARGET_OPTION_VALID_VERSION_ATTRIBUTE_P): New implementation. (TARGET_OPTION_EXPANDED_CLONES_ATTRIBUTE): New implementation. (TARGET_OPTION_FUNCTION_VERSIONS): New implementation. (TARGET_COMPARE_VERSION_PRIORITY): New implementation. (TARGET_GENERATE_VERSION_DISPATCHER_BODY): New implementation. (TARGET_GET_FUNCTION_VERSIONS_DISPATCHER): New implementation. (TARGET_MANGLE_DECL_ASSEMBLER_NAME): New implementation. diff --git a/gcc/attribs.cc b/gcc/attribs.cc index a3c4a81e8582ea4fd06b9518bf51fad7c998ddd6..cc935b502028392ebdc105f940900f01f79196a7 100644 --- a/gcc/attribs.cc +++ b/gcc/attribs.cc @@ -657,7 +657,8 @@ decl_attributes (tree *node, tree attributes, int flags, options to the attribute((target(...))) list. */ if (TREE_CODE (*node) == FUNCTION_DECL && current_target_pragma - && targetm.target_option.valid_attribute_p (*node, NULL_TREE, + && targetm.target_option.valid_attribute_p (*node, + get_identifier("target"), current_target_pragma, 0)) { tree cur_attr = lookup_attribute ("target", attributes); diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 9c3c0e705e2e6ea3b55b4a5f1e7d3360f91eb51d..ca0e2a2507ffdbf99e17b77240504bf2d175b9c0 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -19088,11 +19088,70 @@ aarch64_process_target_attr (tree args) return true; } +/* Parse the tree in ARGS that contains the targeti_version attribute + information and update the global target options space. */ + +bool +aarch64_process_target_version_attr (tree args) +{ + if (TREE_CODE (args) == TREE_LIST) +{ + if (TREE_CHAIN (args)) + { + error ("attribute % has multiple values"); + return false; + } + args = TREE_VALUE (args); +} + + if (!args || TREE_CODE (args) != STRING_CST) +{ + error ("attribute % argument not a string"); + return false; +} + + const char *str = TREE_STRING_POINTER (args); + if (strcmp (str, "default") == 0) +return true; + + auto with_plus = std::string ("+") + str; + enum aarch_parse_opt_result parse_res; + auto isa_flags
[3/3] WIP/RFC: Fix name mangling for target_clones
This is a partial patch to make the mangling of function version names for target_clones match those generated using the target or target_version attributes. It modifies the name of function versions, but does not yet rename the resolved symbol, resulting in a duplicate symbol name (and an error at assembly time). Is this sort of approach ok? Should I create an extra target hook to be called here, so that the target_clones mangling can be target-specific but not necessarily the same as for target attribute versioning? diff --git a/gcc/cgraphclones.cc b/gcc/cgraphclones.cc index 8af6b23d8c0306920e0fdcb3559ef047a16689f4..15672c02c6f9d6043a36bf081067f08d1ab834e5 100644 --- a/gcc/cgraphclones.cc +++ b/gcc/cgraphclones.cc @@ -1033,11 +1033,6 @@ cgraph_node::create_version_clone_with_body else new_decl = copy_node (old_decl); - /* Generate a new name for the new version. */ - tree fnname = (version_decl ? clone_function_name_numbered (old_decl, suffix) - : clone_function_name (old_decl, suffix)); - DECL_NAME (new_decl) = fnname; - SET_DECL_ASSEMBLER_NAME (new_decl, fnname); SET_DECL_RTL (new_decl, NULL); DECL_VIRTUAL_P (new_decl) = 0; @@ -1065,6 +1060,24 @@ cgraph_node::create_version_clone_with_body return NULL; } + /* Generate a new name for the new version. */ + if (version_decl) +{ + tree fnname = (clone_function_name_numbered (old_decl, suffix)); + DECL_NAME (new_decl) = fnname; + SET_DECL_ASSEMBLER_NAME (new_decl, fnname); +} + else +{ + /* Add target version mangling. We assume that the target hook will +produce the same mangled name as it would have produced if the decl +had already been versioned when the hook was previously called. */ + tree fnname = DECL_ASSEMBLER_NAME (old_decl); + DECL_NAME (new_decl) = fnname; + fnname = targetm.mangle_decl_assembler_name (new_decl, fnname); + SET_DECL_ASSEMBLER_NAME (new_decl, fnname); +} + /* When the old decl was a con-/destructor make sure the clone isn't. */ DECL_STATIC_CONSTRUCTOR (new_decl) = 0; DECL_STATIC_DESTRUCTOR (new_decl) = 0; diff --git a/gcc/multiple_target.cc b/gcc/multiple_target.cc index 3db57c2b13d612a37240d9dcf58ad21b2286633c..d9aec9a5ab532701b4a1877b440f3a553ffa28e2 100644 --- a/gcc/multiple_target.cc +++ b/gcc/multiple_target.cc @@ -162,7 +162,12 @@ create_dispatcher_calls (struct cgraph_node *node) } } - tree fname = clone_function_name (node->decl, "default"); + /* Add version mangling to default decl name. We assume that the target + hook will produce the same mangled name as it would have produced if the + decl had already been versioned when the hook was previously called. */ + tree fname = DECL_ASSEMBLER_NAME (node->decl); + DECL_NAME (node->decl) = fname; + fname = targetm.mangle_decl_assembler_name (node->decl, fname); symtab->change_decl_assembler_name (node->decl, fname); if (node->definition)
[COMMITTED] Fix expansion of `(a & 2) != 1`
I had a thinko in r14-1600-ge60593f3881c72a96a3fa4844d73e8a2cd14f670 where we would remove the `& CST` part if we ended up not calling expand_single_bit_test. This fixes the problem by introducing a new variable that will be used for calling expand_single_bit_test. As afar as I know this can only show up when disabling optimization passes as this above form would have been optimized away. Committed as obvious after a bootstrap/test on x86_64-linux-gnu. PR middle-end/111863 gcc/ChangeLog: * expr.cc (do_store_flag): Don't over write arg0 when stripping off `& POW2`. gcc/testsuite/ChangeLog: * gcc.c-torture/execute/pr111863-1.c: New test. --- gcc/expr.cc | 9 + gcc/testsuite/gcc.c-torture/execute/pr111863-1.c | 16 2 files changed, 21 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/gcc.c-torture/execute/pr111863-1.c diff --git a/gcc/expr.cc b/gcc/expr.cc index 8aed3fc6cbe..763bd82c59f 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -13206,14 +13206,15 @@ do_store_flag (sepops ops, rtx target, machine_mode mode) || integer_pow2p (arg1)) && (TYPE_PRECISION (ops->type) != 1 || TYPE_UNSIGNED (ops->type))) { - wide_int nz = tree_nonzero_bits (arg0); - gimple *srcstmt = get_def_for_expr (arg0, BIT_AND_EXPR); + tree narg0 = arg0; + wide_int nz = tree_nonzero_bits (narg0); + gimple *srcstmt = get_def_for_expr (narg0, BIT_AND_EXPR); /* If the defining statement was (x & POW2), then use that instead of the non-zero bits. */ if (srcstmt && integer_pow2p (gimple_assign_rhs2 (srcstmt))) { nz = wi::to_wide (gimple_assign_rhs2 (srcstmt)); - arg0 = gimple_assign_rhs1 (srcstmt); + narg0 = gimple_assign_rhs1 (srcstmt); } if (wi::popcount (nz) == 1 @@ -13227,7 +13228,7 @@ do_store_flag (sepops ops, rtx target, machine_mode mode) type = lang_hooks.types.type_for_mode (mode, unsignedp); return expand_single_bit_test (loc, tcode, -arg0, +narg0, bitnum, type, target, mode); } } diff --git a/gcc/testsuite/gcc.c-torture/execute/pr111863-1.c b/gcc/testsuite/gcc.c-torture/execute/pr111863-1.c new file mode 100644 index 000..4e27fe631b2 --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/execute/pr111863-1.c @@ -0,0 +1,16 @@ +/* { dg-options " -fno-tree-ccp -fno-tree-dominator-opts -fno-tree-vrp" } */ + +__attribute__((noipa)) +int f(int a) +{ +a &= 2; +return a != 1; +} +int main(void) +{ +int t = f(1); +if (!t) +__builtin_abort(); +__builtin_printf("%d\n",t); +return 0; +} -- 2.39.3
[PATCH] aarch64: [PR110986] Emit csinv again for `a ? ~b : b`
After r14-3110-g7fb65f10285, the canonical form for `a ? ~b : b` changed to be `-(a) ^ b` that means for aarch64 we need to add a few new insn patterns to be able to catch this and change it to be what is the canonical form for the aarch64 backend. A secondary pattern was needed to support a zero_extended form too; this adds a testcase for all 3 cases. Bootstrapped and tested on aarch64-linux-gnu with no regressions. PR target/110986 gcc/ChangeLog: * config/aarch64/aarch64.md (*cmov_insn_insv): New pattern. (*cmov_uxtw_insn_insv): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/cond_op-1.c: New test. --- gcc/config/aarch64/aarch64.md| 46 gcc/testsuite/gcc.target/aarch64/cond_op-1.c | 20 + 2 files changed, 66 insertions(+) create mode 100644 gcc/testsuite/gcc.target/aarch64/cond_op-1.c diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 32c7adc8928..59cd0415937 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -4413,6 +4413,52 @@ (define_insn "*csinv3_uxtw_insn3" [(set_attr "type" "csel")] ) +;; There are two canonical forms for `cmp ? ~a : a`. +;; This is the second form and is here to help combine. +;; Support `-(cmp) ^ a` into `cmp ? ~a : a` +;; The second pattern is to support the zero extend'ed version. + +(define_insn_and_split "*cmov_insn_insv" + [(set (match_operand:GPI 0 "register_operand" "=r") +(xor:GPI +(neg:GPI + (match_operator:GPI 1 "aarch64_comparison_operator" + [(match_operand 2 "cc_register" "") (const_int 0)])) +(match_operand:GPI 3 "general_operand" "r")))] + "can_create_pseudo_p ()" + "#" + "&& true" + [(set (match_dup 0) + (if_then_else:GPI (match_dup 1) + (not:GPI (match_dup 3)) + (match_dup 3)))] + { +operands[3] = force_reg (mode, operands[3]); + } + [(set_attr "type" "csel")] +) + +(define_insn_and_split "*cmov_uxtw_insn_insv" + [(set (match_operand:DI 0 "register_operand" "=r") +(zero_extend:DI +(xor:SI + (neg:SI + (match_operator:SI 1 "aarch64_comparison_operator" + [(match_operand 2 "cc_register" "") (const_int 0)])) + (match_operand:SI 3 "general_operand" "r"] + "can_create_pseudo_p ()" + "#" + "&& true" + [(set (match_dup 0) + (if_then_else:DI (match_dup 1) + (zero_extend:DI (not:SI (match_dup 3))) + (zero_extend:DI (match_dup 3] + { +operands[3] = force_reg (SImode, operands[3]); + } + [(set_attr "type" "csel")] +) + ;; If X can be loaded by a single CNT[BHWD] instruction, ;; ;;A = UMAX (B, X) diff --git a/gcc/testsuite/gcc.target/aarch64/cond_op-1.c b/gcc/testsuite/gcc.target/aarch64/cond_op-1.c new file mode 100644 index 000..e6c7821127e --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/cond_op-1.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ +/* PR target/110986 */ + + +long long full(unsigned a, unsigned b) +{ + return a ? ~b : b; +} +unsigned fuu(unsigned a, unsigned b) +{ + return a ? ~b : b; +} +long long f(unsigned long long a, unsigned long long b) +{ + return a ? ~b : b; +} + +/* { dg-final { scan-assembler-times "csinv\tw\[0-9\]*" 2 } } */ +/* { dg-final { scan-assembler-times "csinv\tx\[0-9\]*" 1 } } */ -- 2.39.3
[committed] amdgcn: deprecate Fiji device and multilib
The build has been failing for the last few days because LLVM removed support for the HSACOv3 binary metadata format, which we were still using for the Fiji multilib. The LLVM commit has now been reverted (thank you Pierre van Houtryve), but it's only a temporary repreive. This patch removes Fiji from the default configuration, and updates the documentation accordingly, but no more. Those that still use Fiji devices can re-enable it by configuring using --with-arch=fiji. Why not remove Fiji support entirely? This is simply because about one third of our test farm conists of Fiji devices and we can't replace them quickly. Andrewamdgcn: deprecate Fiji device and multilib LLVM wants to remove it, which breaks our build. This patch means that most users won't notice that change, when it comes, and those that do will have chosen to enable Fiji explicitly. I'm selecting gfx900 as the new default as that's the least likely for users to want, which means most users will specify -march explicitly, which means we'll be free to change the default again, when we need to, without breaking anybody's makefiles. gcc/ChangeLog: * config.gcc (amdgcn): Switch default to --with-arch=gfx900. Implement support for --with-multilib-list. * config/gcn/t-gcn-hsa: Likewise. * doc/install.texi: Likewise. * doc/invoke.texi: Mark Fiji deprecated. diff --git a/gcc/config.gcc b/gcc/config.gcc index 37311fcd075..9c397156868 100644 --- a/gcc/config.gcc +++ b/gcc/config.gcc @@ -4538,7 +4538,19 @@ case "${target}" in ;; esac done - [ "x$with_arch" = x ] && with_arch=fiji + [ "x$with_arch" = x ] && with_arch=gfx900 + + case "x${with_multilib_list}" in + x | xno) + TM_MULTILIB_CONFIG= + ;; + xdefault | xyes) + TM_MULTILIB_CONFIG=`echo "gfx900,gfx906,gfx908,gfx90a" | sed "s/${with_arch},\?//;s/,$//"` + ;; + *) + TM_MULTILIB_CONFIG="${with_multilib_list}" + ;; + esac ;; hppa*-*-*) diff --git a/gcc/config/gcn/t-gcn-hsa b/gcc/config/gcn/t-gcn-hsa index ea27122e484..18db7075356 100644 --- a/gcc/config/gcn/t-gcn-hsa +++ b/gcc/config/gcn/t-gcn-hsa @@ -42,8 +42,12 @@ ALL_HOST_OBJS += gcn-run.o gcn-run$(exeext): gcn-run.o +$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ $< -ldl -MULTILIB_OPTIONS = march=gfx900/march=gfx906/march=gfx908/march=gfx90a -MULTILIB_DIRNAMES = gfx900 gfx906 gfx908 gfx90a +empty := +space := $(empty) $(empty) +comma := , +multilib_list := $(subst $(comma),$(space),$(TM_MULTILIB_CONFIG)) +MULTILIB_OPTIONS = $(subst $(space),/,$(addprefix march=,$(multilib_list))) +MULTILIB_DIRNAMES = $(multilib_list) gcn-tree.o: $(srcdir)/config/gcn/gcn-tree.cc $(COMPILE) $< diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi index 31f2234640f..4035e8020b2 100644 --- a/gcc/doc/install.texi +++ b/gcc/doc/install.texi @@ -1236,8 +1236,8 @@ sysv, aix. @itemx --without-multilib-list Specify what multilibs to build. @var{list} is a comma separated list of values, possibly consisting of a single value. Currently only implemented -for aarch64*-*-*, arm*-*-*, loongarch*-*-*, riscv*-*-*, sh*-*-* and -x86-64-*-linux*. The accepted values and meaning for each target is given +for aarch64*-*-*, amdgcn*-*-*, arm*-*-*, loongarch*-*-*, riscv*-*-*, sh*-*-* +and x86-64-*-linux*. The accepted values and meaning for each target is given below. @table @code @@ -1250,6 +1250,15 @@ default run-time library will be built. If @var{list} is default set of libraries is selected based on the value of @option{--target}. +@item amdgcn*-*-* +@var{list} is a comma separated list of ISA names (allowed values: @code{fiji}, +@code{gfx900}, @code{gfx906}, @code{gfx908}, @code{gfx90a}). It ought not +include the name of the default ISA, specified via @option{--with-arch}. If +@var{list} is empty, then there will be no multilibs and only the default +run-time library will be built. If @var{list} is @code{default} or +@option{--with-multilib-list=} is not specified, then the default set of +libraries is selected. + @item arm*-*-* @var{list} is a comma separated list of @code{aprofile} and @code{rmprofile} to build multilibs for A or R and M architecture @@ -3922,6 +3931,12 @@ To run the binaries, install the HSA Runtime from the @file{libexec/gcc/amdhsa-amdhsa/@var{version}/gcn-run} to launch them on the GPU. +To enable support for GCN3 Fiji devices (gfx803), GCC has to be configured with +@option{--with-arch=@code{fiji}} or +@option{--with-multilib-list=@code{fiji},...}. Note that support for Fiji +devices has been removed in ROCm 4.0 and support in LLVM is deprecated and will +be removed in the future. + @html @end html diff --git a/gcc/doc/i
[PATCH] wwwdocs: gcc-14: mark amdgcn fiji deprecated
OK to commit? Andrewgcc-14: mark amdgcn fiji deprecated diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html index c817dde4..91ab8132 100644 --- a/htdocs/gcc-14/changes.html +++ b/htdocs/gcc-14/changes.html @@ -178,6 +178,16 @@ a work-in-progress. +AMD Radeon (GCN) + + + The Fiji device support is now deprecated and will be removed from a + future release. The default compiler configuration no longer uses Fiji + as the default device, and no longer includes the Fiji libraries. Both + can be restored by configuring with --with-arch=fiji. + The default device architecture is now gfx900 (Vega). + +
[PATCH] c: [PR104822] Don't warn about converting NULL to different sso endian
In a similar way we don't warn about NULL pointer constant conversion to a different named address we should not warn to a different sso endian either. This adds the simple check. Bootstrapped and tested on x86_64-linux-gnu with no regressions. PR c/104822 gcc/c/ChangeLog: * c-typeck.cc (convert_for_assignment): Check for null pointer before warning about an incompatible scalar storage order. gcc/testsuite/ChangeLog: * gcc.dg/sso-18.c: New test. --- gcc/c/c-typeck.cc | 1 + gcc/testsuite/gcc.dg/sso-18.c | 16 2 files changed, 17 insertions(+) create mode 100644 gcc/testsuite/gcc.dg/sso-18.c diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc index 6e044b4afbc..f39dc71d593 100644 --- a/gcc/c/c-typeck.cc +++ b/gcc/c/c-typeck.cc @@ -7449,6 +7449,7 @@ convert_for_assignment (location_t location, location_t expr_loc, tree type, /* See if the pointers point to incompatible scalar storage orders. */ if (warn_scalar_storage_order + && !null_pointer_constant_p (rhs) && (AGGREGATE_TYPE_P (ttl) && TYPE_REVERSE_STORAGE_ORDER (ttl)) != (AGGREGATE_TYPE_P (ttr) && TYPE_REVERSE_STORAGE_ORDER (ttr))) { diff --git a/gcc/testsuite/gcc.dg/sso-18.c b/gcc/testsuite/gcc.dg/sso-18.c new file mode 100644 index 000..799a0c858f2 --- /dev/null +++ b/gcc/testsuite/gcc.dg/sso-18.c @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* PR c/104822 */ + +#include + +struct Sb { + int i; +} __attribute__((scalar_storage_order("big-endian"))); +struct Sl { + int i; +} __attribute__((scalar_storage_order("little-endian"))); + +/* Neither of these should warn about incompatible scalar storage order + as NULL pointers are compatiable with both endian. */ +struct Sb *pb = NULL; /* { dg-bogus "" } */ +struct Sl *pl = NULL; /* { dg-bogus "" } */ -- 2.39.3
[PATCH] c: [PR100532] Fix ICE when an agrgument was an error mark
In the case of convert_argument, we would return the same expression back rather than error_mark_node after the error message about trying to convert to an incomplete type. This causes issues in the gimplfier trying to see if another conversion is needed. The code here dates back to before the revision history too so it might be the case it never noticed we should return an error_mark_node. Bootstrapped and tested on x86_64-linux-gnu with no regressions. PR c/100532 gcc/c/ChangeLog: * c-typeck.cc (convert_argument): After erroring out about an incomplete type return error_mark_node. gcc/testsuite/ChangeLog: * gcc.dg/pr100532-1.c: New test. --- gcc/c/c-typeck.cc | 2 +- gcc/testsuite/gcc.dg/pr100532-1.c | 7 +++ 2 files changed, 8 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.dg/pr100532-1.c diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc index 6e044b4afbc..8f8562936dc 100644 --- a/gcc/c/c-typeck.cc +++ b/gcc/c/c-typeck.cc @@ -3367,7 +3367,7 @@ convert_argument (location_t ploc, tree function, tree fundecl, { error_at (ploc, "type of formal parameter %d is incomplete", parmnum + 1); - return val; + return error_mark_node; } /* Optionally warn about conversions that differ from the default diff --git a/gcc/testsuite/gcc.dg/pr100532-1.c b/gcc/testsuite/gcc.dg/pr100532-1.c new file mode 100644 index 000..81e37c60415 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr100532-1.c @@ -0,0 +1,7 @@ +/* { dg-do compile } */ +/* PR c/100532 */ + +typedef __SIZE_TYPE__ size_t; +void *memcpy(void[], const void *, size_t); /* { dg-error "declaration of type name" } */ +void c(void) { memcpy(c, "a", 2); } /* { dg-error "type of formal parameter" } */ + -- 2.34.1
Re: [1/3] Add support for target_version attribute
On Thu, Oct 19, 2023 at 07:04:09AM +, Richard Biener wrote: > On Wed, 18 Oct 2023, Andrew Carlotti wrote: > > > This patch adds support for the "target_version" attribute to the middle > > end and the C++ frontend, which will be used to implement function > > multiversioning in the aarch64 backend. > > > > Note that C++ is currently the only frontend which supports > > multiversioning using the "target" attribute, whereas the > > "target_clones" attribute is additionally supported in C, D and Ada. > > Support for the target_version attribute will be extended to C at a > > later date. > > > > Targets that currently use the "target" attribute for function > > multiversioning (i.e. i386 and rs6000) are not affected by this patch. > > > > > > I could have implemented the target hooks slightly differently, by reusing > > the > > valid_attribute_p hook and adding attribute name checks to each backend > > implementation (c.f. the aarch64 implementation in patch 2/3). Would this > > be > > preferable? > > > > Otherwise, is this ok for master? > > This lacks user-level documentation in doc/extend.texi (where > target_clones is documented). Good point. I'll add documentation updates as a separate patch in the series (rather than documenting the state after this patch, in which the attribute is supported on zero targets). I think the existing documentation for target and target_clones needs some improvement as well. > Was there any discussion/description of why target_clones cannot > be made work for aarch64? > > Richard. The second patch in this series does include support for target_clones on aarch64. However, the support in that patch is not fully compliant with our ACLE specification. I also have some unresolved questions about the correctness of current function multiversioning implementations using ifuncs across translation units, which could affect how we want to implement it for aarch64. Andrew > > > > gcc/c-family/ChangeLog: > > > > * c-attribs.cc (handle_target_version_attribute): New. > > (c_common_attribute_table): Add target_version. > > (handle_target_clones_attribute): Add conflict with > > target_version attribute. > > > > gcc/ChangeLog: > > > > * attribs.cc (is_function_default_version): Update comment to > > specify incompatibility with target_version attributes. > > * cgraphclones.cc (cgraph_node::create_version_clone_with_body): > > Call valid_version_attribute_p for target_version attributes. > > * target.def (valid_version_attribute_p): New hook. > > (expanded_clones_attribute): New hook. > > * doc/tm.texi.in: Add new hooks. > > * doc/tm.texi: Regenerate. > > * multiple_target.cc (create_dispatcher_calls): Remove redundant > > is_function_default_version check. > > (expand_target_clones): Use target hook for attribute name. > > * targhooks.cc (default_target_option_valid_version_attribute_p): > > New. > > * targhooks.h (default_target_option_valid_version_attribute_p): > > New. > > * tree.h (DECL_FUNCTION_VERSIONED): Update comment to include > > target_version attributes. > > > > gcc/cp/ChangeLog: > > > > * decl2.cc (check_classfn): Update comment to include > > target_version attributes. > > > > > > diff --git a/gcc/attribs.cc b/gcc/attribs.cc > > index > > b1300018d1e8ed8e02ded1ea721dc192a6d32a49..a3c4a81e8582ea4fd06b9518bf51fad7c998ddd6 > > 100644 > > --- a/gcc/attribs.cc > > +++ b/gcc/attribs.cc > > @@ -1233,8 +1233,9 @@ make_dispatcher_decl (const tree decl) > >return func_decl; > > } > > > > -/* Returns true if decl is multi-versioned and DECL is the default > > function, > > - that is it is not tagged with target specific optimization. */ > > +/* Returns true if DECL is multi-versioned using the target attribute, and > > this > > + is the default version. This function can only be used for targets > > that do > > + not support the "target_version" attribute. */ > > > > bool > > is_function_default_version (const tree decl) > > diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc > > index > > 072cfb69147bd6b314459c0bd48a0c1fb92d3e4d..1a224c036277d51ab4dc0d33a403177bd226e48a > > 100644 > > --- a/gcc/c-family/c-attribs.cc > > +++ b/gcc/c-family/c-attribs.cc > > @@ -148,6 +148,7 @@ static tree handle_alloc_align_attribute (tree *, tree,
Re: [PATCH] [ARC] Add support for HS4x cpus.
* Claudiu Zissulescu [2018-06-13 12:09:18 +0300]: > From: Claudiu Zissulescu > > This patch adds support for two ARCHS variations. > > Ok to apply? > Claudiu Sorry for the delay, this looks fine. Thanks, Andrew > > gcc/ > 2017-03-10 Claudiu Zissulescu > > * config/arc/arc-arch.h (arc_tune_attr): Add new tune parameters > for ARCHS4x. > * config/arc/arc-cpus.def (hs4x): New cpu. > (hs4xd): Likewise. > * config/arc/arc-tables.opt: Regenerate. > * config/arc/arc.c (arc_sched_issue_rate): New function. > (TARGET_SCHED_ISSUE_RATE): Define. > (TARGET_SCHED_EXPOSED_PIPELINE): Likewise. > * config/arc/arc.md (attr type): Add fpu_fuse, fpu_sdiv, fpu_ddiv, > fpu_cvt. > (attr tune): Add ARCHS4x tune values. > (attr tune_dspmpy): Define. > (*tst): Correct instruction type. > * config/arc/arcHS.md: Don't use this automaton for ARCHS4x cpus. > * config/arc/arcHS4x.md: New file. > * config/arc/fpu.md: Update instruction type attributes. > * config/arc/t-multilib: Regenerate. > --- > gcc/config/arc/arc-arch.h | 5 +- > gcc/config/arc/arc-cpus.def | 8 +- > gcc/config/arc/arc-tables.opt | 6 + > gcc/config/arc/arc.c | 19 +++ > gcc/config/arc/arc.md | 24 +++- > gcc/config/arc/arcHS.md | 6 + > gcc/config/arc/arcHS4x.md | 221 ++ > gcc/config/arc/fpu.md | 16 +-- > 8 files changed, 289 insertions(+), 16 deletions(-) > create mode 100644 gcc/config/arc/arcHS4x.md > > diff --git a/gcc/config/arc/arc-arch.h b/gcc/config/arc/arc-arch.h > index 64866dd529b..01f95946623 100644 > --- a/gcc/config/arc/arc-arch.h > +++ b/gcc/config/arc/arc-arch.h > @@ -73,7 +73,10 @@ enum arc_tune_attr > ARC_TUNE_ARC600, > ARC_TUNE_ARC700_4_2_STD, > ARC_TUNE_ARC700_4_2_XMAC, > -ARC_TUNE_CORE_3 > +ARC_TUNE_CORE_3, > +ARC_TUNE_ARCHS4X, > +ARC_TUNE_ARCHS4XD, > +ARC_TUNE_ARCHS4XD_SLOW >}; > > /* CPU specific properties. */ > diff --git a/gcc/config/arc/arc-cpus.def b/gcc/config/arc/arc-cpus.def > index 1fce81f6933..4aa422f1a39 100644 > --- a/gcc/config/arc/arc-cpus.def > +++ b/gcc/config/arc/arc-cpus.def > @@ -59,10 +59,12 @@ ARC_CPU (archs,hs, FL_MPYOPT_2|FL_DIVREM|FL_LL64, > NONE) > ARC_CPU (hs34,hs, FL_MPYOPT_2, NONE) > ARC_CPU (hs38,hs, FL_MPYOPT_9|FL_DIVREM|FL_LL64, NONE) > ARC_CPU (hs38_linux, hs, FL_MPYOPT_9|FL_DIVREM|FL_LL64|FL_FPU_FPUD_ALL, NONE) > +ARC_CPU (hs4x, hs, FL_MPYOPT_9|FL_DIVREM|FL_LL64, ARCHS4X) > +ARC_CPU (hs4xd, hs, FL_MPYOPT_9|FL_DIVREM|FL_LL64, ARCHS4XD) > > -ARC_CPU (arc600, 6xx, FL_BS, ARC600) > -ARC_CPU (arc600_norm, 6xx, FL_BS|FL_NORM, ARC600) > -ARC_CPU (arc600_mul64, 6xx, FL_BS|FL_NORM|FL_MUL64, ARC600) > +ARC_CPU (arc600, 6xx, FL_BS, ARC600) > +ARC_CPU (arc600_norm, 6xx, FL_BS|FL_NORM, ARC600) > +ARC_CPU (arc600_mul64,6xx, FL_BS|FL_NORM|FL_MUL64, ARC600) > ARC_CPU (arc600_mul32x16, 6xx, FL_BS|FL_NORM|FL_MUL32x16, ARC600) > ARC_CPU (arc601, 6xx, 0, ARC600) > ARC_CPU (arc601_norm, 6xx, FL_NORM, ARC600) > diff --git a/gcc/config/arc/arc-tables.opt b/gcc/config/arc/arc-tables.opt > index 3b17b3de7d5..2afaf5bd83c 100644 > --- a/gcc/config/arc/arc-tables.opt > +++ b/gcc/config/arc/arc-tables.opt > @@ -63,6 +63,12 @@ Enum(processor_type) String(hs38) Value(PROCESSOR_hs38) > EnumValue > Enum(processor_type) String(hs38_linux) Value(PROCESSOR_hs38_linux) > > +EnumValue > +Enum(processor_type) String(hs4x) Value(PROCESSOR_hs4x) > + > +EnumValue > +Enum(processor_type) String(hs4xd) Value(PROCESSOR_hs4xd) > + > EnumValue > Enum(processor_type) String(arc600) Value(PROCESSOR_arc600) > > diff --git a/gcc/config/arc/arc.c b/gcc/config/arc/arc.c > index 2bedc9af37e..03a2f4223c0 100644 > --- a/gcc/config/arc/arc.c > +++ b/gcc/config/arc/arc.c > @@ -483,6 +483,22 @@ arc_autovectorize_vector_sizes (vector_sizes *sizes) > } > } > > + > +/* Implements target hook TARGET_SCHED_ISSUE_RATE. */ > +static int > +arc_sched_issue_rate (void) > +{ > + switch (arc_tune) > +{ > +case TUNE_ARCHS4X: > +case TUNE_ARCHS4XD: > + return 3; > +default: > + break; > +} > + return 1; > +} > + > /* TARGET_PRESERVE_RELOAD_P is still awaiting patch re-evaluation / review. > */ > static bool arc_preserve_reload_p (rtx in) ATTRIBUTE_UNUSED; > static rtx arc_delegitimize_address (rtx); > @@ -565,6 +581,9 @@ static rtx arc_legitimize_address_0 (rtx, rtx, > machine_mode mode); > #undef TARGET_SCHE
Re: [PATCH, GCC, AARCH64] Add support for +profile extension
On Mon, Jul 9, 2018 at 6:21 AM Andre Vieira (lists) wrote: > > Hi, > > This patch adds support for the Statistical Profiling Extension (SPE) on > AArch64. Even though the compiler will not generate code any differently > given this extension, it will need to pass it on to the assembler in > order to let it correctly assemble inline asm containing accesses to the > extension's system registers. The same applies when using the > preprocessor on an assembly file as this first must pass through cc1. > > I left the hwcaps string for SPE empty as the kernel does not define a > feature string for this extension. The current effect of this is that > driver will disable profile feature bit in GCC. This is OK though > because we don't, nor do we ever, enable this feature bit, as codegen is > not affect by the SPE support and more importantly the driver will still > pass the extension down to the assembler regardless. > > Boostrapped aarch64-none-linux-gnu and ran regression tests. > > Is it OK for trunk? I use a similar patch for the last year and half. Thanks, Andrew > > gcc/ChangeLog: > 2018-07-09 Andre Vieira > > * config/aarch64/aarch64-option-extensions.def: New entry for profile > extension. > * config/aarch64/aarch64.h (AARCH64_FL_PROFILE): New. > * doc/invoke.texi (aarch64-feature-modifiers): New entry for profile > extension. > > gcc/testsuite/ChangeLog: > 2018-07-09 Andre Vieira > > * gcc.target/aarch64/profile.c: New test.
Re: [RFC] Fix recent popcount change is breaking
On Tue, Jul 10, 2018 at 6:14 PM Kugan Vivekanandarajah wrote: > > On 10 July 2018 at 23:17, Richard Biener wrote: > > On Tue, Jul 10, 2018 at 3:06 PM Kugan Vivekanandarajah > > wrote: > >> > >> Hi, > >> > >> Jeff told me that the recent popcount built-in detection is causing > >> kernel build issues as > >> ERROR: "__popcountsi2" > >> [drivers/net/wireless/broadcom/brcm80211/brcmfmac/brcmfmac.ko] undefined! > >> > >> I could also reproduce this. AFIK, we should check if the libfunc is > >> defined while checking popcount? > >> > >> I am testing the attached RFC patch. Is this reasonable? > > > > It doesn't work that way, all targets have this libfunc in libgcc. This > > means > > the kernel has to provide it. The only thing you could do is restrict > > replacement of CALL_EXPRs (in SCEV cprop) to those the target > > natively supports. > > How about restricting it in expression_expensive_p ? Is that what you > wanted. Attached patch does this. > Bootstrap and regression testing progressing. Seems like that should go into is_inexpensive_builtin instead which is just tested right below. Thanks, Andrew > > Thanks, > Kugan > > > > > Richard. > > > >> Thanks, > >> Kugan > >> > >> gcc/ChangeLog: > >> > >> 2018-07-10 Kugan Vivekanandarajah > >> > >> * tree-ssa-loop-niter.c (number_of_iterations_popcount): Check > >> if libfunc for popcount is available.
Re: [RFC] Fix recent popcount change is breaking
On Tue, Jul 10, 2018 at 6:35 PM Kugan Vivekanandarajah wrote: > > Hi Andrew, > > On 11 July 2018 at 11:19, Andrew Pinski wrote: > > On Tue, Jul 10, 2018 at 6:14 PM Kugan Vivekanandarajah > > wrote: > >> > >> On 10 July 2018 at 23:17, Richard Biener > >> wrote: > >> > On Tue, Jul 10, 2018 at 3:06 PM Kugan Vivekanandarajah > >> > wrote: > >> >> > >> >> Hi, > >> >> > >> >> Jeff told me that the recent popcount built-in detection is causing > >> >> kernel build issues as > >> >> ERROR: "__popcountsi2" > >> >> [drivers/net/wireless/broadcom/brcm80211/brcmfmac/brcmfmac.ko] > >> >> undefined! > >> >> > >> >> I could also reproduce this. AFIK, we should check if the libfunc is > >> >> defined while checking popcount? > >> >> > >> >> I am testing the attached RFC patch. Is this reasonable? > >> > > >> > It doesn't work that way, all targets have this libfunc in libgcc. This > >> > means > >> > the kernel has to provide it. The only thing you could do is restrict > >> > replacement of CALL_EXPRs (in SCEV cprop) to those the target > >> > natively supports. > >> > >> How about restricting it in expression_expensive_p ? Is that what you > >> wanted. Attached patch does this. > >> Bootstrap and regression testing progressing. > > > > Seems like that should go into is_inexpensive_builtin instead which > > is just tested right below. > > I hought about that. is_inexpensive_builtin is used in various other > places including some inlining decision so wasn't sure if it is the > right thing. Happy to change it if that is the right thing to do. I audited all of the users (and their users if it is used in a wrapper) and found that is_inexpensive_builtin should return false for this builtin if it is a function call in the end; there are other builtins which should be checked the similar way but I think we should not going to force you to do the similar thing for those builtins. Thanks, Andrew > > Thanks, > Kugan > > > > Thanks, > > Andrew > > > >> > >> Thanks, > >> Kugan > >> > >> > > >> > Richard. > >> > > >> >> Thanks, > >> >> Kugan > >> >> > >> >> gcc/ChangeLog: > >> >> > >> >> 2018-07-10 Kugan Vivekanandarajah > >> >> > >> >> * tree-ssa-loop-niter.c (number_of_iterations_popcount): Check > >> >> if libfunc for popcount is available.
Re: [PATCH 1/4] [ARC] Add more additional register names
All the patches in this series look fine. Thanks, Andrew * Claudiu Zissulescu [2018-07-16 15:29:42 +0300]: > From: claziss > > gcc/ > 2017-06-14 Claudiu Zissulescu > > * config/arc/arc.h (ADDITIONAL_REGISTER_NAMES): Add additional > register names. > --- > gcc/config/arc/arc.h | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/gcc/config/arc/arc.h b/gcc/config/arc/arc.h > index 1780034aabe..3648314eaca 100644 > --- a/gcc/config/arc/arc.h > +++ b/gcc/config/arc/arc.h > @@ -1215,7 +1215,15 @@ extern char rname56[], rname57[], rname58[], rname59[]; > {\ >{"ilink", 29},\ >{"r29",29},\ > - {"r30",30} \ > + {"r30",30},\ > + {"r40",40},\ > + {"r41",41},\ > + {"r42",42},\ > + {"r43",43},\ > + {"r56",56},\ > + {"r57",57},\ > + {"r58",58},\ > + {"r59",59} \ > } > > /* Entry to the insn conditionalizer. */ > -- > 2.17.1 >
Re: [PATCH][AARCH64] PR target/84521 Fix frame pointer corruption with -fomit-frame-pointer with __builtin_setjmp
On Tue, Jul 31, 2018 at 2:43 PM James Greenhalgh wrote: > > On Thu, Jul 12, 2018 at 12:01:09PM -0500, Sudakshina Das wrote: > > Hi Eric > > > > On 27/06/18 12:22, Wilco Dijkstra wrote: > > > Eric Botcazou wrote: > > > > > >>> This test can easily be changed not to use optimize since it doesn't > > >>> look > > >>> like it needs it. We really need to tests these builtins properly, > > >>> otherwise they will continue to fail on most targets. > > >> > > >> As far as I can see PR target/84521 has been reported only for Aarch64 > > >> so I'd > > >> just leave the other targets alone (and avoid propagating FUD if > > >> possible). > > > > > > It's quite obvious from PR84521 that this is an issue affecting all > > > targets. > > > Adding better generic tests for __builtin_setjmp can only be a good thing. > > > > > > Wilco > > > > > > > This conversation seems to have died down and I would like to > > start it again. I would agree with Wilco's suggestion about > > keeping the test in the generic folder. I have removed the > > optimize attribute and the effect is still the same. It passes > > on AArch64 with this patch and it currently fails on x86 > > trunk (gcc version 9.0.0 20180712 (experimental) (GCC)) > > on -O1 and above. > > > I don't see where the FUD comes in here; either this builtin has a defined > semantics across targets and they are adhered to, or the builtin doesn't have > well defined semantics, or the targets fail to implement those semantics. The problem comes from the fact the builtins are not documented at all. See PR59039 for the issue on them not being documented. Thanks, Andrew > > I think this should go in as is. If other targets are unhappy with the > failing test they should fix their target or skip the test if it is not > appropriate. > > You may want to CC some of the maintainers of platforms you know to fail as > a courtesy on the PR (add your testcase, and add failing targets and their > maintainers to that PR) before committing so it doesn't come as a complete > surprise. > > This is OK with some attempt to get target maintainers involved in the > conversation before commit. > > Thanks, > James > > > diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h > > index f284e74..9792d28 100644 > > --- a/gcc/config/aarch64/aarch64.h > > +++ b/gcc/config/aarch64/aarch64.h > > @@ -473,7 +473,9 @@ extern unsigned aarch64_architecture_version; > > #define EH_RETURN_STACKADJ_RTX gen_rtx_REG (Pmode, R4_REGNUM) > > #define EH_RETURN_HANDLER_RTX aarch64_eh_return_handler_rtx () > > > > -/* Don't use __builtin_setjmp until we've defined it. */ > > +/* Don't use __builtin_setjmp until we've defined it. > > + CAUTION: This macro is only used during exception unwinding. > > + Don't fall for its name. */ > > #undef DONT_USE_BUILTIN_SETJMP > > #define DONT_USE_BUILTIN_SETJMP 1 > > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > > index 01f35f8..4266a3d 100644 > > --- a/gcc/config/aarch64/aarch64.c > > +++ b/gcc/config/aarch64/aarch64.c > > @@ -3998,7 +3998,7 @@ static bool > > aarch64_needs_frame_chain (void) > > { > >/* Force a frame chain for EH returns so the return address is at FP+8. > > */ > > - if (frame_pointer_needed || crtl->calls_eh_return) > > + if (frame_pointer_needed || crtl->calls_eh_return || > > cfun->has_nonlocal_label) > > return true; > > > >/* A leaf function cannot have calls or write LR. */ > > @@ -12218,6 +12218,13 @@ aarch64_expand_builtin_va_start (tree valist, rtx > > nextarg ATTRIBUTE_UNUSED) > >expand_expr (t, const0_rtx, VOIDmode, EXPAND_NORMAL); > > } > > > > +/* Implement TARGET_BUILTIN_SETJMP_FRAME_VALUE. */ > > +static rtx > > +aarch64_builtin_setjmp_frame_value (void) > > +{ > > + return hard_frame_pointer_rtx; > > +} > > + > > /* Implement TARGET_GIMPLIFY_VA_ARG_EXPR. */ > > > > static tree > > @@ -17744,6 +17751,9 @@ aarch64_run_selftests (void) > > #undef TARGET_FOLD_BUILTIN > > #define TARGET_FOLD_BUILTIN aarch64_fold_builtin > > > > +#undef TARGET_BUILTIN_SETJMP_FRAME_VALUE > > +#define TARGET_BUILTIN_SETJMP_FRAME_VALUE > > aarch64_builtin_setjmp_frame_value > > + > > #undef TARGET_FUNCTION_ARG > > #define TARGET_FUNCTION_ARG aarc
[PATCH] Add COMPLEX_VECTOR_INT modes
Hi all, I want to implement a vector DIVMOD libfunc for amdgcn, but I can't just do it because the GCC middle-end models DIVMOD's return value as "complex int" type, and there are no vector equivalents of that type. Therefore, this patch adds minimal support for "complex vector int" modes. I have not attempted to provide any means to use these modes from C, so they're really only useful for DIVMOD. The actual libfunc implementation will pack the data into wider vector modes manually. A knock-on effect of this is that I needed to increase the range of "mode_unit_size" (several of the vector modes supported by amdgcn exceed the previous 255-byte limit). Since this change would add a large number of new, unused modes to many architectures, I have elected to *not* enable them, by default, in machmode.def (where the other complex modes are created). The new modes are therefore inactive on all architectures but amdgcn, for now. OK for mainline? (I've not done a full test yet, but I will.) Thanks AndrewAdd COMPLEX_VECTOR_INT modes for amdgcn This enables only minimal support for complex types containing integer vectors with the intention of allowing vectorized divmod libfunc operations (these return a pair of integers modelled as a complex number). There's no way to declare variables of this mode in the front-end, and no attempt to support it everywhere that complex modes can exist; the only use-case, at present, is the implicit use by divmod calls generated by the middle-end. In order to prevent unexpected problems with other architectures these modes are only enabled for amdgcn. gcc/ChangeLog: * config/gcn/gcn-modes.def: Initialize COMPLEX_VECTOR_INT modes. * genmodes.cc (complex_class): Support MODE_COMPLEX_VECTOR_INT. (complete_mode): Likewise. (emit_mode_unit_size): Upgrade mode_unit_size type to short. (emit_mode_adjustments): Support MODE_COMPLEX_VECTOR_INT. * machmode.def: Mention MODE_COMPLEX_VECTOR_INT. * machmode.h (mode_to_unit_size): Upgrade type to short. * mode-classes.def: Add MODE_COMPLEX_VECTOR_INT. * stor-layout.cc (int_mode_for_mode): Support MODE_COMPLEX_VECTOR_INT. * tree.cc (build_complex_type): Allow VECTOR_INTEGER_TYPE_P. diff --git a/gcc/config/gcn/gcn-modes.def b/gcc/config/gcn/gcn-modes.def index 1357bec825d..486168fbeb3 100644 --- a/gcc/config/gcn/gcn-modes.def +++ b/gcc/config/gcn/gcn-modes.def @@ -121,3 +121,6 @@ ADJUST_ALIGNMENT (V2TI, 16); ADJUST_ALIGNMENT (V2HF, 2); ADJUST_ALIGNMENT (V2SF, 4); ADJUST_ALIGNMENT (V2DF, 8); + +/* These are used for vectorized divmod. */ +COMPLEX_MODES (VECTOR_INT); diff --git a/gcc/genmodes.cc b/gcc/genmodes.cc index 715787b8f48..d472ee5a9a3 100644 --- a/gcc/genmodes.cc +++ b/gcc/genmodes.cc @@ -125,6 +125,7 @@ complex_class (enum mode_class c) case MODE_INT: return MODE_COMPLEX_INT; case MODE_PARTIAL_INT: return MODE_COMPLEX_INT; case MODE_FLOAT: return MODE_COMPLEX_FLOAT; +case MODE_VECTOR_INT: return MODE_COMPLEX_VECTOR_INT; default: error ("no complex class for class %s", mode_class_names[c]); return MODE_RANDOM; @@ -382,6 +383,7 @@ complete_mode (struct mode_data *m) case MODE_COMPLEX_INT: case MODE_COMPLEX_FLOAT: +case MODE_COMPLEX_VECTOR_INT: /* Complex modes should have a component indicated, but no more. */ validate_mode (m, UNSET, UNSET, SET, UNSET, UNSET); m->ncomponents = 2; @@ -1173,10 +1175,10 @@ inline __attribute__((__always_inline__))\n\ #else\n\ extern __inline__ __attribute__((__always_inline__, __gnu_inline__))\n\ #endif\n\ -unsigned char\n\ +unsigned short\n\ mode_unit_size_inline (machine_mode mode)\n\ {\n\ - extern CONST_MODE_UNIT_SIZE unsigned char mode_unit_size[NUM_MACHINE_MODES];\ + extern CONST_MODE_UNIT_SIZE unsigned short mode_unit_size[NUM_MACHINE_MODES];\ \n\ gcc_assert (mode >= 0 && mode < NUM_MACHINE_MODES);\n\ switch (mode)\n\ @@ -1683,7 +1685,7 @@ emit_mode_unit_size (void) int c; struct mode_data *m; - print_maybe_const_decl ("%sunsigned char", "mode_unit_size", + print_maybe_const_decl ("%sunsigned short", "mode_unit_size", "NUM_MACHINE_MODES", adj_bytesize); for_all_modes (c, m) @@ -1873,6 +1875,7 @@ emit_mode_adjustments (void) { case MODE_COMPLEX_INT: case MODE_COMPLEX_FLOAT: +case MODE_COMPLEX_VECTOR_INT: printf (" mode_size[E_%smode] = 2*s;\n", m->name); printf (" mode_unit_size[E_%smode] = s;\n", m->name); printf (" mode_base_align[E_%smode] = s & (~s + 1);\n", @@ -1920,6 +1923,7 @@ emit_mode_adjustments (void) { case MODE_COMPLEX_INT: case MODE_COMPLEX_FLOAT: + case MODE_COMPLEX_VECTOR_INT: printf (" mode_base_align[E_%smode] = s;\n", m->name); break; diff --git a/gcc/machmode.def b/gcc/machmode
Re: [patch] amdgcn: Change -m(no-)xnack to -mxnack=(on,off,any)
OK. Andrew On 26/05/2023 15:58, Tobias Burnus wrote: (Update the syntax of the amdgcn commandline option in anticipation of later patches; while -m(no-)xnack is in mainline since r12-2396-gaad32a00b7d2b6 (for PR100208), -mxsnack (contrary to -msram-ecc) is currently mostly a stub for later patches and is documented as such in invoke.texi. Thus, this change should have no (or only a minimal) affect on users.) GCC currently supports for GCN -mxnack / -mno-xnack arguments, matching +xnack and -xnack when passed to the LLVM linker. However, since V4 the latter supports three states, besides on/off there is now also unspecified. That matches the semantic of sram(-)ecc, which GCC already implements as 'on'/'off' and 'any'. Cf. https://llvm.org/docs/AMDGPUUsage.html#target-features>> and https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gcc/AMD-GCN-Options.html>> The attached patch uses the sram-ecc flag syntax now also for xnack. Note that currently only 'no' is supported which is ensured via a 'sorry'. Hence, the default is 'no'. I assume we want to change the default once XNACK is working - therefore, the documentation does only states the current default as a comment. The changes were picked from the patch "amdgcn: Support XNACK mode" at - https://gcc.gnu.org/pipermail/gcc-patches/2022-July/597991.html>> - OG12 0229066ecb24421d48e3e0d56f31c30cc1affdab - OG13 cbc3dd01de8788587a2b641efcb838058303b5ab but only includes all changes related to the commandline option changes, excluding the other changes like those to isns. It additionally updates invoke.texi (using the wording from -msram-ecc). (I actually encountered this issue because of the non-updated manual.) Tested with full bootstrap, regtesting running, but not expecting surprised. OK for mainline? Tobias PS: For FIJI, "" is passed – that's ensured by NO_XNACK in the ASM_SPEC and the 'switch' later in output_file_start (unchanged), otherwise 'xnack-' is used (via the default in gcn.opt for the compiler and via XNACKOPT for the command line.) - Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Re: [PATCH] Add COMPLEX_VECTOR_INT modes
On 30/05/2023 07:26, Richard Biener wrote: On Fri, May 26, 2023 at 4:35 PM Andrew Stubbs wrote: Hi all, I want to implement a vector DIVMOD libfunc for amdgcn, but I can't just do it because the GCC middle-end models DIVMOD's return value as "complex int" type, and there are no vector equivalents of that type. Therefore, this patch adds minimal support for "complex vector int" modes. I have not attempted to provide any means to use these modes from C, so they're really only useful for DIVMOD. The actual libfunc implementation will pack the data into wider vector modes manually. A knock-on effect of this is that I needed to increase the range of "mode_unit_size" (several of the vector modes supported by amdgcn exceed the previous 255-byte limit). Since this change would add a large number of new, unused modes to many architectures, I have elected to *not* enable them, by default, in machmode.def (where the other complex modes are created). The new modes are therefore inactive on all architectures but amdgcn, for now. OK for mainline? (I've not done a full test yet, but I will.) I think it makes more sense to map vector CSImode to vector SImode with the double number of lanes. In fact since divmod is a libgcc function I wonder where your vector variant would reside and how GCC decides to emit calls to it? That is, there's no way to OMP simd declare this function? The divmod implementation lives in libgcc. It's not too difficult to write using vector extensions and some asm tricks. I did try an OMP simd declare implementation, but it didn't vectorize well, and that's a yack I don't wish to shave right now. In any case, the OMP simd declare will not help us here, directly, because the DIVMOD transformation happens too late in the pass pipeline, long after ifcvt and vect. My implementation (not yet posted), uses a libfunc and the TARGET_EXPAND_DIVMOD_LIBFUNC hook in the standard way. It just needs the complex vector modes to exist. Using vectors twice the length is problematic also. If I create a new V128SImode that spans across two 64-lane vector registers then that will probably have the desired effect ("real" quotient in v8, "imaginary" remainder in v9), but if I use V64SImode to represent two V32SImode vectors then that's a one-register mode, and I'll have to use a permutation (a memory operation) to extract lanes 32-63 into lanes 0-31, and if we ever want to implement instructions that operate on these modes (as opposed to the odd/even add/sub complex patterns we have now) then the masking will be all broken and we'd need to constantly disassemble the double length vectors to operate on them. The implementation I proposed is essentially a struct containing two vectors placed in consecutive registers. This is the natural representation for the architecture. Anyway, you don't like this patch and I see that AArch64 is picking apart BLKmode to see if there's complex inside, so maybe I can make something like that work here? AArch64 doesn't seem to use TARGET_EXPAND_DIVMOD_LIBFUNC though, and I'm pretty sure the problem I was trying to solve was in the way the expand pass handles the BLKmode complex, outside the control of the backend hook (I'm still paging this stuff back in, post vacation). Thanks Andrew
Re: [Patch] libgomp: plugin-gcn - support 'unified_address'
On 06/06/2023 16:33, Tobias Burnus wrote: Andrew: Does the GCN change look okay to you? This patch permits to use GCN devices with 'omp requires unified_address' which in principle works already, except that the requirement handling did disable it. (It also updates libgomp.texi for this change and likewise for an older likewise nvptx change.) I will later add a testcase → https://gcc.gnu.org/PR109837>> However, the patch was tested with the respective sollve_vv testcase with an additional fix applied on top → https://github.com/SOLLVE/sollve_vv/pull/737>> (I do note that with the USM patches for OG12/OG13, unified_address is accepted, cf. OG13 https://gcc.gnu.org/g:3ddf3565faee70e8c910d90ab0c80e71813a0ba1 , but USM itself goes much beyond what we need here.) OK, I think this is fine. I was going to do this with the patch series soon anyway. Andrew
Re: [PATCH] Add COMPLEX_VECTOR_INT modes
On 07/06/2023 20:42, Richard Sandiford wrote: I don't know if this helps (probably not), but we have a similar situation on AArch64: a 64-bit mode like V8QI can be doubled to a 128-bit vector or to a pair of 64-bit vectors. We used V16QI for the former and "V2x8QI" for the latter. V2x8QI is forced to come after V16QI in the mode list, and so it is only ever used through explicit choice. But both modes are functionally vectors of 16 QIs. OK, that's interesting, but how do you map "complex int" vectors to that mode? I tried to figure it out, but there's no DIVMOD support so I couldn't just do a straight comparison. Thanks Andrew
Re: [PATCH] Add COMPLEX_VECTOR_INT modes
On 09/06/2023 10:02, Richard Sandiford wrote: Andrew Stubbs writes: On 07/06/2023 20:42, Richard Sandiford wrote: I don't know if this helps (probably not), but we have a similar situation on AArch64: a 64-bit mode like V8QI can be doubled to a 128-bit vector or to a pair of 64-bit vectors. We used V16QI for the former and "V2x8QI" for the latter. V2x8QI is forced to come after V16QI in the mode list, and so it is only ever used through explicit choice. But both modes are functionally vectors of 16 QIs. OK, that's interesting, but how do you map "complex int" vectors to that mode? I tried to figure it out, but there's no DIVMOD support so I couldn't just do a straight comparison. Yeah, we don't do that currently. Instead we make TARGET_ARRAY_MODE return V2x8QI for an array of 2 V8QIs (which is OK, since V2x8QI has 64-bit rather than 128-bit alignment). So we should use it for a complex-y type like: struct { res_type res[2]; }; In principle we should be able to do the same for: struct { res_type a, b; }; but that isn't supported yet. I think it would need a new target hook along the lines of TARGET_ARRAY_MODE, but for structs rather than arrays. The advantage of this from AArch64's PoV is that it extends to 3x and 4x tuples as well, whereas complex is obviously for pairs only. I don't know if it would be acceptable to use that kind of struct wrapper for the divmod code though (for the vector case only). Looking again, I don't think this will help because GCN does not have an instruction that loads vectors that are back-to-back, hence there's little benefit in adding the tuple mode. However, GCN does have instructions that effectively load 2, 3, or 4 vectors that are *interleaved*, which would be the likely case for complex numbers (or pixel colour data!) I need to figure out how to move forward with this patch, please; if the new complex modes are not acceptable then I think I need to reimplement DIVMOD (maybe the scalars can remain as-is), but it's not clear to me what that would look like. Andrew
[PATCH] vect: Vectorize via libfuncs
This patch allows vectorization when operators are available as libfuncs, rather that only as insns. This will be useful for amdgcn where we plan to vectorize loops that contain integer division or modulus, but don't want to generate inline instructions for the division algorithm every time. The change should have not affect architectures that do not define vector-mode libfuncs. OK for mainline? Andrewvect: vectorize via libfuncs This patch allows vectorization when the libfuncs are defined. gcc/ChangeLog: * tree-vect-generic.cc: Include optabs-libfuncs.h. (get_compute_type): Check optab_libfunc. * tree-vect-stmts.cc: Include optabs-libfuncs.h. (vectorizable_operation): Check optab_libfunc. diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc index b7d4a919c55..4d784a70c0d 100644 --- a/gcc/tree-vect-generic.cc +++ b/gcc/tree-vect-generic.cc @@ -44,6 +44,7 @@ along with GCC; see the file COPYING3. If not see #include "gimple-fold.h" #include "gimple-match.h" #include "recog.h" /* FIXME: for insn_data */ +#include "optabs-libfuncs.h" /* Build a ternary operation and gimplify it. Emit code before GSI. @@ -1714,7 +1715,8 @@ get_compute_type (enum tree_code code, optab op, tree type) machine_mode compute_mode = TYPE_MODE (compute_type); if (VECTOR_MODE_P (compute_mode)) { - if (op && optab_handler (op, compute_mode) != CODE_FOR_nothing) + if (op && (optab_handler (op, compute_mode) != CODE_FOR_nothing +|| optab_libfunc (op, compute_mode))) return compute_type; if (code == MULT_HIGHPART_EXPR && can_mult_highpart_p (compute_mode, diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index a7acc032d47..71a8cf2c6d4 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -56,6 +56,7 @@ along with GCC; see the file COPYING3. If not see #include "gimple-fold.h" #include "regs.h" #include "attribs.h" +#include "optabs-libfuncs.h" /* For lang_hooks.types.type_for_mode. */ #include "langhooks.h" @@ -6528,8 +6529,8 @@ vectorizable_operation (vec_info *vinfo, "no optab.\n"); return false; } - target_support_p = (optab_handler (optab, vec_mode) - != CODE_FOR_nothing); + target_support_p = (optab_handler (optab, vec_mode) != CODE_FOR_nothing + || optab_libfunc (optab, vec_mode)); } bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);
Re: [PATCH 3/3] AVX512 fully masked vectorization
SC-V here since they are going to get both masks and lengths registered I think. The vect_prepare_for_masked_peels hunk might run into issues with SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE looked odd. Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run the testsuite with --param vect-partial-vector-usage=2 with and without -fno-vect-cost-model and filed two bugs, one ICE (PR110221) and one latent wrong-code (PR110237). There's followup work to be done to try enabling masked epilogues for x86-64 by default (when AVX512 is enabled, possibly only when -mprefer-vector-width=512). Getting cost modeling and decision right is going to be challenging. Any comments? OK? Btw, testing on GCN would be welcome - the _avx512 paths could work for it so in case the while_ult path fails (not sure if it ever does) it could get _avx512 style masking. Likewise testing on ARM just to see I didn't break anything here. I don't have SVE hardware so testing is probably meaningless. I can set some tests going. Is vect.exp enough? Andrew
Re: [PATCH 3/3] AVX512 fully masked vectorization
On 14/06/2023 15:29, Richard Biener wrote: Am 14.06.2023 um 16:27 schrieb Andrew Stubbs : On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote: This implemens fully masked vectorization or a masked epilog for AVX512 style masks which single themselves out by representing each lane with a single bit and by using integer modes for the mask (both is much like GCN). AVX512 is also special in that it doesn't have any instruction to compute the mask from a scalar IV like SVE has with while_ult. Instead the masks are produced by vector compares and the loop control retains the scalar IV (mainly to avoid dependences on mask generation, a suitable mask test instruction is available). This is also sounds like GCN. We currently use WHILE_ULT in the middle end which expands to a vector compare against a vector of stepped values. This requires an additional instruction to prepare the comparison vector (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns the DImode bitmask, so it works reasonably well. Like RVV code generation prefers a decrementing IV though IVOPTs messes things up in some cases removing that IV to eliminate it with an incrementing one used for address generation. One of the motivating testcases is from PR108410 which in turn is extracted from x264 where large size vectorization shows issues with small trip loops. Execution time there improves compared to classic AVX512 with AVX2 epilogues for the cases of less than 32 iterations. size scalar 128 256 512512e512f 19.42 11.329.35 11.17 15.13 16.89 25.726.536.666.667.628.56 34.495.105.105.745.085.73 44.104.334.295.213.794.25 63.783.853.864.762.542.85 83.641.893.764.501.922.16 123.562.213.754.261.261.42 163.360.831.064.160.951.07 203.391.421.334.070.750.85 243.230.661.724.220.620.70 283.181.092.044.200.540.61 323.160.470.410.410.470.53 343.160.670.610.560.440.50 383.190.950.950.820.400.45 423.090.581.211.130.360.40 'size' specifies the number of actual iterations, 512e is for a masked epilog and 512f for the fully masked loop. From 4 scalar iterations on the AVX512 masked epilog code is clearly the winner, the fully masked variant is clearly worse and it's size benefit is also tiny. Let me check I understand correctly. In the fully masked case, there is a single loop in which a new mask is generated at the start of each iteration. In the masked epilogue case, the main loop uses no masking whatsoever, thus avoiding the need for generating a mask, carrying the mask, inserting vec_merge operations, etc, and then the epilogue looks much like the fully masked case, but unlike smaller mode epilogues there is no loop because the eplogue vector size is the same. Is that right? Yes. This scheme seems like it might also benefit GCN, in so much as it simplifies the hot code path. GCN does not actually have smaller vector sizes, so there's no analogue to AVX2 (we pretend we have some smaller sizes, but that's because the middle end can't do masking everywhere yet, and it helps make some vector constants smaller, perhaps). This patch does not enable using fully masked loops or masked epilogues by default. More work on cost modeling and vectorization kind selection on x86_64 is necessary for this. Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE which could be exploited further to unify some of the flags we have right now but there didn't seem to be many easy things to merge, so I'm leaving this for followups. Mask requirements as registered by vect_record_loop_mask are kept in their original form and recorded in a hash_set now instead of being processed to a vector of rgroup_controls. Instead that's now left to the final analysis phase which tries forming the rgroup_controls vector using while_ult and if that fails now tries AVX512 style which needs a different organization and instead fills a hash_map with the relevant info. vect_get_loop_mask now has two implementations, one for the two mask styles we then have. I have decided against interweaving vect_set_loop_condition_partial_vectors with conditions to do AVX512 style masking and instead opted to "duplicate" this to vect_set_loop_condition_partial_vectors_avx512. Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512. I was split between making 'vec_loop_masks' a class with methods, possibly merging in the _len stuff into a single registry. It seemed to be too many changes for the purpose of getti
Re: [PATCH 3/3] AVX512 fully masked vectorization
On 15/06/2023 10:58, Richard Biener wrote: On Thu, 15 Jun 2023, Andrew Stubbs wrote: On 14/06/2023 15:29, Richard Biener wrote: Am 14.06.2023 um 16:27 schrieb Andrew Stubbs : On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote: This implemens fully masked vectorization or a masked epilog for AVX512 style masks which single themselves out by representing each lane with a single bit and by using integer modes for the mask (both is much like GCN). AVX512 is also special in that it doesn't have any instruction to compute the mask from a scalar IV like SVE has with while_ult. Instead the masks are produced by vector compares and the loop control retains the scalar IV (mainly to avoid dependences on mask generation, a suitable mask test instruction is available). This is also sounds like GCN. We currently use WHILE_ULT in the middle end which expands to a vector compare against a vector of stepped values. This requires an additional instruction to prepare the comparison vector (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns the DImode bitmask, so it works reasonably well. Like RVV code generation prefers a decrementing IV though IVOPTs messes things up in some cases removing that IV to eliminate it with an incrementing one used for address generation. One of the motivating testcases is from PR108410 which in turn is extracted from x264 where large size vectorization shows issues with small trip loops. Execution time there improves compared to classic AVX512 with AVX2 epilogues for the cases of less than 32 iterations. size scalar 128 256 512512e512f 19.42 11.329.35 11.17 15.13 16.89 25.726.536.666.667.628.56 34.495.105.105.745.085.73 44.104.334.295.213.794.25 63.783.853.864.762.542.85 83.641.893.764.501.922.16 123.562.213.754.261.261.42 163.360.831.064.160.951.07 203.391.421.334.070.750.85 243.230.661.724.220.620.70 283.181.092.044.200.540.61 323.160.470.410.410.470.53 343.160.670.610.560.440.50 383.190.950.950.820.400.45 423.090.581.211.130.360.40 'size' specifies the number of actual iterations, 512e is for a masked epilog and 512f for the fully masked loop. From 4 scalar iterations on the AVX512 masked epilog code is clearly the winner, the fully masked variant is clearly worse and it's size benefit is also tiny. Let me check I understand correctly. In the fully masked case, there is a single loop in which a new mask is generated at the start of each iteration. In the masked epilogue case, the main loop uses no masking whatsoever, thus avoiding the need for generating a mask, carrying the mask, inserting vec_merge operations, etc, and then the epilogue looks much like the fully masked case, but unlike smaller mode epilogues there is no loop because the eplogue vector size is the same. Is that right? Yes. This scheme seems like it might also benefit GCN, in so much as it simplifies the hot code path. GCN does not actually have smaller vector sizes, so there's no analogue to AVX2 (we pretend we have some smaller sizes, but that's because the middle end can't do masking everywhere yet, and it helps make some vector constants smaller, perhaps). This patch does not enable using fully masked loops or masked epilogues by default. More work on cost modeling and vectorization kind selection on x86_64 is necessary for this. Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE which could be exploited further to unify some of the flags we have right now but there didn't seem to be many easy things to merge, so I'm leaving this for followups. Mask requirements as registered by vect_record_loop_mask are kept in their original form and recorded in a hash_set now instead of being processed to a vector of rgroup_controls. Instead that's now left to the final analysis phase which tries forming the rgroup_controls vector using while_ult and if that fails now tries AVX512 style which needs a different organization and instead fills a hash_map with the relevant info. vect_get_loop_mask now has two implementations, one for the two mask styles we then have. I have decided against interweaving vect_set_loop_condition_partial_vectors with conditions to do AVX512 style masking and instead opted to "duplicate" this to vect_set_loop_condition_partial_vectors_avx512. Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512. I was split between making 'vec_loop_masks' a class with methods, possibly merging in the _len
Re: [PATCH 3/3] AVX512 fully masked vectorization
On 15/06/2023 12:06, Richard Biener wrote: On Thu, 15 Jun 2023, Andrew Stubbs wrote: On 15/06/2023 10:58, Richard Biener wrote: On Thu, 15 Jun 2023, Andrew Stubbs wrote: On 14/06/2023 15:29, Richard Biener wrote: Am 14.06.2023 um 16:27 schrieb Andrew Stubbs : On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote: This implemens fully masked vectorization or a masked epilog for AVX512 style masks which single themselves out by representing each lane with a single bit and by using integer modes for the mask (both is much like GCN). AVX512 is also special in that it doesn't have any instruction to compute the mask from a scalar IV like SVE has with while_ult. Instead the masks are produced by vector compares and the loop control retains the scalar IV (mainly to avoid dependences on mask generation, a suitable mask test instruction is available). This is also sounds like GCN. We currently use WHILE_ULT in the middle end which expands to a vector compare against a vector of stepped values. This requires an additional instruction to prepare the comparison vector (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns the DImode bitmask, so it works reasonably well. Like RVV code generation prefers a decrementing IV though IVOPTs messes things up in some cases removing that IV to eliminate it with an incrementing one used for address generation. One of the motivating testcases is from PR108410 which in turn is extracted from x264 where large size vectorization shows issues with small trip loops. Execution time there improves compared to classic AVX512 with AVX2 epilogues for the cases of less than 32 iterations. size scalar 128 256 512512e512f 19.42 11.329.35 11.17 15.13 16.89 25.726.536.666.667.628.56 34.495.105.105.745.085.73 44.104.334.295.213.794.25 63.783.853.864.762.542.85 83.641.893.764.501.922.16 123.562.213.754.261.261.42 163.360.831.064.160.951.07 203.391.421.334.070.750.85 243.230.661.724.220.620.70 283.181.092.044.200.540.61 323.160.470.410.410.470.53 343.160.670.610.560.440.50 383.190.950.950.820.400.45 423.090.581.211.130.360.40 'size' specifies the number of actual iterations, 512e is for a masked epilog and 512f for the fully masked loop. From 4 scalar iterations on the AVX512 masked epilog code is clearly the winner, the fully masked variant is clearly worse and it's size benefit is also tiny. Let me check I understand correctly. In the fully masked case, there is a single loop in which a new mask is generated at the start of each iteration. In the masked epilogue case, the main loop uses no masking whatsoever, thus avoiding the need for generating a mask, carrying the mask, inserting vec_merge operations, etc, and then the epilogue looks much like the fully masked case, but unlike smaller mode epilogues there is no loop because the eplogue vector size is the same. Is that right? Yes. This scheme seems like it might also benefit GCN, in so much as it simplifies the hot code path. GCN does not actually have smaller vector sizes, so there's no analogue to AVX2 (we pretend we have some smaller sizes, but that's because the middle end can't do masking everywhere yet, and it helps make some vector constants smaller, perhaps). This patch does not enable using fully masked loops or masked epilogues by default. More work on cost modeling and vectorization kind selection on x86_64 is necessary for this. Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE which could be exploited further to unify some of the flags we have right now but there didn't seem to be many easy things to merge, so I'm leaving this for followups. Mask requirements as registered by vect_record_loop_mask are kept in their original form and recorded in a hash_set now instead of being processed to a vector of rgroup_controls. Instead that's now left to the final analysis phase which tries forming the rgroup_controls vector using while_ult and if that fails now tries AVX512 style which needs a different organization and instead fills a hash_map with the relevant info. vect_get_loop_mask now has two implementations, one for the two mask styles we then have. I have decided against interweaving vect_set_loop_condition_partial_vectors with conditions to do AVX512 style masking and instead opted to "duplicate" this to vect_set_loop_condition_partial_vectors_avx512. Likewise for vect_verify_full_masking vs vect_verify_full_masking_av
Re: [PATCH 3/3] AVX512 fully masked vectorization
On 15/06/2023 14:34, Richard Biener wrote: On Thu, 15 Jun 2023, Andrew Stubbs wrote: On 15/06/2023 12:06, Richard Biener wrote: On Thu, 15 Jun 2023, Andrew Stubbs wrote: On 15/06/2023 10:58, Richard Biener wrote: On Thu, 15 Jun 2023, Andrew Stubbs wrote: On 14/06/2023 15:29, Richard Biener wrote: Am 14.06.2023 um 16:27 schrieb Andrew Stubbs : On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote: This implemens fully masked vectorization or a masked epilog for AVX512 style masks which single themselves out by representing each lane with a single bit and by using integer modes for the mask (both is much like GCN). AVX512 is also special in that it doesn't have any instruction to compute the mask from a scalar IV like SVE has with while_ult. Instead the masks are produced by vector compares and the loop control retains the scalar IV (mainly to avoid dependences on mask generation, a suitable mask test instruction is available). This is also sounds like GCN. We currently use WHILE_ULT in the middle end which expands to a vector compare against a vector of stepped values. This requires an additional instruction to prepare the comparison vector (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns the DImode bitmask, so it works reasonably well. Like RVV code generation prefers a decrementing IV though IVOPTs messes things up in some cases removing that IV to eliminate it with an incrementing one used for address generation. One of the motivating testcases is from PR108410 which in turn is extracted from x264 where large size vectorization shows issues with small trip loops. Execution time there improves compared to classic AVX512 with AVX2 epilogues for the cases of less than 32 iterations. size scalar 128 256 512512e512f 19.42 11.329.35 11.17 15.13 16.89 25.726.536.666.667.628.56 34.495.105.105.745.085.73 44.104.334.295.213.794.25 63.783.853.864.762.542.85 83.641.893.764.501.922.16 123.562.213.754.261.261.42 163.360.831.064.160.951.07 203.391.421.334.070.750.85 243.230.661.724.220.620.70 283.181.092.044.200.540.61 323.160.470.410.410.470.53 343.160.670.610.560.440.50 383.190.950.950.820.400.45 423.090.581.211.130.360.40 'size' specifies the number of actual iterations, 512e is for a masked epilog and 512f for the fully masked loop. From 4 scalar iterations on the AVX512 masked epilog code is clearly the winner, the fully masked variant is clearly worse and it's size benefit is also tiny. Let me check I understand correctly. In the fully masked case, there is a single loop in which a new mask is generated at the start of each iteration. In the masked epilogue case, the main loop uses no masking whatsoever, thus avoiding the need for generating a mask, carrying the mask, inserting vec_merge operations, etc, and then the epilogue looks much like the fully masked case, but unlike smaller mode epilogues there is no loop because the eplogue vector size is the same. Is that right? Yes. This scheme seems like it might also benefit GCN, in so much as it simplifies the hot code path. GCN does not actually have smaller vector sizes, so there's no analogue to AVX2 (we pretend we have some smaller sizes, but that's because the middle end can't do masking everywhere yet, and it helps make some vector constants smaller, perhaps). This patch does not enable using fully masked loops or masked epilogues by default. More work on cost modeling and vectorization kind selection on x86_64 is necessary for this. Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE which could be exploited further to unify some of the flags we have right now but there didn't seem to be many easy things to merge, so I'm leaving this for followups. Mask requirements as registered by vect_record_loop_mask are kept in their original form and recorded in a hash_set now instead of being processed to a vector of rgroup_controls. Instead that's now left to the final analysis phase which tries forming the rgroup_controls vector using while_ult and if that fails now tries AVX512 style which needs a different organization and instead fills a hash_map with the relevant info. vect_get_loop_mask now has two implementations, one for the two mask styles we then have. I have decided against interweaving vect_set_loop_condition_partial_vectors with conditions to do AVX512 style masking and instead opted to "duplicate" this to vect_set_loop_cond
Re: [PATCH 3/3] AVX512 fully masked vectorization
On 15/06/2023 15:00, Richard Biener wrote: On Thu, 15 Jun 2023, Andrew Stubbs wrote: On 15/06/2023 14:34, Richard Biener wrote: On Thu, 15 Jun 2023, Andrew Stubbs wrote: On 15/06/2023 12:06, Richard Biener wrote: On Thu, 15 Jun 2023, Andrew Stubbs wrote: On 15/06/2023 10:58, Richard Biener wrote: On Thu, 15 Jun 2023, Andrew Stubbs wrote: On 14/06/2023 15:29, Richard Biener wrote: Am 14.06.2023 um 16:27 schrieb Andrew Stubbs : On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote: This implemens fully masked vectorization or a masked epilog for AVX512 style masks which single themselves out by representing each lane with a single bit and by using integer modes for the mask (both is much like GCN). AVX512 is also special in that it doesn't have any instruction to compute the mask from a scalar IV like SVE has with while_ult. Instead the masks are produced by vector compares and the loop control retains the scalar IV (mainly to avoid dependences on mask generation, a suitable mask test instruction is available). This is also sounds like GCN. We currently use WHILE_ULT in the middle end which expands to a vector compare against a vector of stepped values. This requires an additional instruction to prepare the comparison vector (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns the DImode bitmask, so it works reasonably well. Like RVV code generation prefers a decrementing IV though IVOPTs messes things up in some cases removing that IV to eliminate it with an incrementing one used for address generation. One of the motivating testcases is from PR108410 which in turn is extracted from x264 where large size vectorization shows issues with small trip loops. Execution time there improves compared to classic AVX512 with AVX2 epilogues for the cases of less than 32 iterations. size scalar 128 256 512512e512f 19.42 11.329.35 11.17 15.13 16.89 25.726.536.666.667.628.56 34.495.105.105.745.085.73 44.104.334.295.213.794.25 63.783.853.864.762.542.85 83.641.893.764.501.922.16 123.562.213.754.261.261.42 163.360.831.064.160.951.07 203.391.421.334.070.750.85 243.230.661.724.220.620.70 283.181.092.044.200.540.61 323.160.470.410.410.470.53 343.160.670.610.560.440.50 383.190.950.950.820.400.45 423.090.581.211.130.360.40 'size' specifies the number of actual iterations, 512e is for a masked epilog and 512f for the fully masked loop. From 4 scalar iterations on the AVX512 masked epilog code is clearly the winner, the fully masked variant is clearly worse and it's size benefit is also tiny. Let me check I understand correctly. In the fully masked case, there is a single loop in which a new mask is generated at the start of each iteration. In the masked epilogue case, the main loop uses no masking whatsoever, thus avoiding the need for generating a mask, carrying the mask, inserting vec_merge operations, etc, and then the epilogue looks much like the fully masked case, but unlike smaller mode epilogues there is no loop because the eplogue vector size is the same. Is that right? Yes. This scheme seems like it might also benefit GCN, in so much as it simplifies the hot code path. GCN does not actually have smaller vector sizes, so there's no analogue to AVX2 (we pretend we have some smaller sizes, but that's because the middle end can't do masking everywhere yet, and it helps make some vector constants smaller, perhaps). This patch does not enable using fully masked loops or masked epilogues by default. More work on cost modeling and vectorization kind selection on x86_64 is necessary for this. Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE which could be exploited further to unify some of the flags we have right now but there didn't seem to be many easy things to merge, so I'm leaving this for followups. Mask requirements as registered by vect_record_loop_mask are kept in their original form and recorded in a hash_set now instead of being processed to a vector of rgroup_controls. Instead that's now left to the final analysis phase which tries forming the rgroup_controls vector using while_ult and if that fails now tries AVX512 style which needs a different organization and instead fills a hash_map with the relevant info. vect_get_loop_mask now has two implementations, one for the two mask styles we then have. I have decided against interweaving vect_set_loop_condition_partial_vectors wit
[PATCH 00/17] openmp, nvptx, amdgcn: 5.0 Memory Allocators
This patch series implements OpenMP allocators for low-latency memory on nvptx, unified shared memory on both nvptx and amdgcn, and generic pinned memory support for all Linux hosts (an nvptx-specific implementation using Cuda pinned memory is planned for the future, as is low-latency memory on amdgcn). Patches 01 to 14 are reposts of patches previously submitted, now forward ported to the current master branch and with the various follow-up patches folded in. Where it conflicts with the new memkind implementation the memkind takes precedence (but there's currently no way to implement memory that's both high-bandwidth and pinned anyway). Patches 15 to 17 are new work. I can probably approve these myself, but they can't be committed until the rest of the series is approved. Andrew Andrew Stubbs (11): libgomp, nvptx: low-latency memory allocator libgomp: pinned memory libgomp, openmp: Add ompx_pinned_mem_alloc openmp, nvptx: low-lat memory access traits openmp, nvptx: ompx_unified_shared_mem_alloc openmp: Add -foffload-memory openmp: allow requires unified_shared_memory openmp: -foffload-memory=pinned amdgcn: Support XNACK mode amdgcn, openmp: Auto-detect USM mode and set HSA_XNACK amdgcn: libgomp plugin USM implementation Hafiz Abid Qadeer (6): openmp: Use libgomp memory allocation functions with unified shared memory. Add parsing support for allocate directive (OpenMP 5.0) Translate allocate directive (OpenMP 5.0). Handle cleanup of omp allocated variables (OpenMP 5.0). Gimplify allocate directive (OpenMP 5.0). Lower allocate directive (OpenMP 5.0). gcc/c/c-parser.cc | 22 +- gcc/common.opt| 16 + gcc/config/gcn/gcn-hsa.h | 3 +- gcc/config/gcn/gcn-opts.h | 10 +- gcc/config/gcn/gcn-valu.md| 29 +- gcc/config/gcn/gcn.cc | 62 ++- gcc/config/gcn/gcn.md | 113 +++-- gcc/config/gcn/gcn.opt| 18 +- gcc/config/gcn/mkoffload.cc | 56 ++- gcc/coretypes.h | 7 + gcc/cp/parser.cc | 22 +- gcc/doc/gimple.texi | 38 +- gcc/doc/invoke.texi | 16 +- gcc/fortran/dump-parse-tree.cc| 3 + gcc/fortran/gfortran.h| 5 +- gcc/fortran/match.h | 1 + gcc/fortran/openmp.cc | 242 ++- gcc/fortran/parse.cc | 10 +- gcc/fortran/resolve.cc| 1 + gcc/fortran/st.cc | 1 + gcc/fortran/trans-decl.cc | 20 + gcc/fortran/trans-openmp.cc | 50 +++ gcc/fortran/trans.cc | 1 + gcc/gimple-pretty-print.cc| 37 ++ gcc/gimple.cc | 12 + gcc/gimple.def| 6 + gcc/gimple.h | 60 ++- gcc/gimplify.cc | 19 + gcc/gsstruct.def | 1 + gcc/omp-builtins.def | 3 + gcc/omp-low.cc| 383 + gcc/passes.def| 1 + .../c-c++-common/gomp/alloc-pinned-1.c| 28 ++ gcc/testsuite/c-c++-common/gomp/usm-1.c | 4 + gcc/testsuite/c-c++-common/gomp/usm-2.c | 46 +++ gcc/testsuite/c-c++-common/gomp/usm-3.c | 44 ++ gcc/testsuite/c-c++-common/gomp/usm-4.c | 4 + gcc/testsuite/g++.dg/gomp/usm-1.C | 32 ++ gcc/testsuite/g++.dg/gomp/usm-2.C | 30 ++ gcc/testsuite/g++.dg/gomp/usm-3.C | 38 ++ gcc/testsuite/gfortran.dg/gomp/allocate-4.f90 | 112 + gcc/testsuite/gfortran.dg/gomp/allocate-5.f90 | 73 gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 | 84 gcc/testsuite/gfortran.dg/gomp/allocate-7.f90 | 13 + gcc/testsuite/gfortran.dg/gomp/allocate-8.f90 | 15 + gcc/testsuite/gfortran.dg/gomp/usm-1.f90 | 6 + gcc/testsuite/gfortran.dg/gomp/usm-2.f90 | 16 + gcc/testsuite/gfortran.dg/gomp/usm-3.f90 | 13 + gcc/testsuite/gfortran.dg/gomp/usm-4.f90 | 6 + gcc/tree-core.h | 9 + gcc/tree-pass.h | 1 + gcc/tree-pretty-print.cc | 23 ++ gcc/tree.cc | 1 + gcc/tree.def | 4 + gcc/tree.h| 15 + include/cuda/cuda.h | 12 + libgomp/allocator.c | 304 ++ libgomp/config/linux/allocator.c | 137 +++ libgomp/config/nvptx/allocator.c | 387 ++ libgomp/conf
[PATCH 02/17] libgomp: pinned memory
Implement the OpenMP pinned memory trait on Linux hosts using the mlock syscall. Pinned allocations are performed using mmap, not malloc, to ensure that they can be unpinned safely when freed. libgomp/ChangeLog: * allocator.c (MEMSPACE_ALLOC): Add PIN. (MEMSPACE_CALLOC): Add PIN. (MEMSPACE_REALLOC): Add PIN. (MEMSPACE_FREE): Add PIN. (xmlock): New function. (omp_init_allocator): Don't disallow the pinned trait. (omp_aligned_alloc): Add pinning to all MEMSPACE_* calls. (omp_aligned_calloc): Likewise. (omp_realloc): Likewise. (omp_free): Likewise. * config/linux/allocator.c: New file. * config/nvptx/allocator.c (MEMSPACE_ALLOC): Add PIN. (MEMSPACE_CALLOC): Add PIN. (MEMSPACE_REALLOC): Add PIN. (MEMSPACE_FREE): Add PIN. * testsuite/libgomp.c/alloc-pinned-1.c: New test. * testsuite/libgomp.c/alloc-pinned-2.c: New test. * testsuite/libgomp.c/alloc-pinned-3.c: New test. * testsuite/libgomp.c/alloc-pinned-4.c: New test. --- libgomp/allocator.c | 67 ++ libgomp/config/linux/allocator.c | 99 ++ libgomp/config/nvptx/allocator.c | 8 +- libgomp/testsuite/libgomp.c/alloc-pinned-1.c | 95 + libgomp/testsuite/libgomp.c/alloc-pinned-2.c | 101 ++ libgomp/testsuite/libgomp.c/alloc-pinned-3.c | 130 ++ libgomp/testsuite/libgomp.c/alloc-pinned-4.c | 132 +++ 7 files changed, 602 insertions(+), 30 deletions(-) create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-1.c create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-2.c create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-3.c create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-4.c diff --git a/libgomp/allocator.c b/libgomp/allocator.c index 9b33bcf529b..54310ab93ca 100644 --- a/libgomp/allocator.c +++ b/libgomp/allocator.c @@ -39,16 +39,20 @@ /* These macros may be overridden in config//allocator.c. */ #ifndef MEMSPACE_ALLOC -#define MEMSPACE_ALLOC(MEMSPACE, SIZE) malloc (SIZE) +#define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \ + (PIN ? NULL : malloc (SIZE)) #endif #ifndef MEMSPACE_CALLOC -#define MEMSPACE_CALLOC(MEMSPACE, SIZE) calloc (1, SIZE) +#define MEMSPACE_CALLOC(MEMSPACE, SIZE, PIN) \ + (PIN ? NULL : calloc (1, SIZE)) #endif #ifndef MEMSPACE_REALLOC -#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) realloc (ADDR, SIZE) +#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE, OLDPIN, PIN) \ + ((PIN) || (OLDPIN) ? NULL : realloc (ADDR, SIZE)) #endif #ifndef MEMSPACE_FREE -#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) free (ADDR) +#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \ + (PIN ? NULL : free (ADDR)) #endif /* Map the predefined allocators to the correct memory space. @@ -351,10 +355,6 @@ omp_init_allocator (omp_memspace_handle_t memspace, int ntraits, break; } - /* No support for this so far. */ - if (data.pinned) -return omp_null_allocator; - ret = gomp_malloc (sizeof (struct omp_allocator_data)); *ret = data; #ifndef HAVE_SYNC_BUILTINS @@ -481,7 +481,8 @@ retry: } else #endif - ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size); + ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size, + allocator_data->pinned); if (ptr == NULL) { #ifdef HAVE_SYNC_BUILTINS @@ -511,7 +512,8 @@ retry: = (allocator_data ? allocator_data->memspace : predefined_alloc_mapping[allocator]); - ptr = MEMSPACE_ALLOC (memspace, new_size); + ptr = MEMSPACE_ALLOC (memspace, new_size, +allocator_data && allocator_data->pinned); } if (ptr == NULL) goto fail; @@ -542,9 +544,9 @@ fail: #ifdef LIBGOMP_USE_MEMKIND || memkind #endif - || (allocator_data - && allocator_data->pool_size < ~(uintptr_t) 0) - || !allocator_data) + || !allocator_data + || allocator_data->pool_size < ~(uintptr_t) 0 + || allocator_data->pinned) { allocator = omp_default_mem_alloc; goto retry; @@ -596,6 +598,7 @@ omp_free (void *ptr, omp_allocator_handle_t allocator) struct omp_mem_header *data; omp_memspace_handle_t memspace __attribute__((unused)) = omp_default_mem_space; + int pinned __attribute__((unused)) = false; if (ptr == NULL) return; @@ -627,6 +630,7 @@ omp_free (void *ptr, omp_allocator_handle_t allocator) #endif memspace = allocator_data->memspace; + pinned = allocator_data->pinned; } else { @@ -651,7 +655,7 @@ omp_free (void *ptr, omp_allocator_handle_t allocator) memspace = predefined_alloc_mapping[data->allocator]; } - MEMSPACE_FREE (memspace, data->ptr, data->size); + MEMSPACE_FREE (memspace, data->ptr, data->size, pinned); } ialias (omp_free) @@ -767,7 +771,8 @@ retry: } else #endif - ptr = MEMSPACE_CALLOC (allocator_data->memspace, new_size); + ptr =
[PATCH 01/17] libgomp, nvptx: low-latency memory allocator
This patch adds support for allocating low-latency ".shared" memory on NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc. The memory can be allocated, reallocated, and freed using a basic but fast algorithm, is thread safe and the size of the low-latency heap can be configured using the GOMP_NVPTX_LOWLAT_POOL environment variable. The use of the PTX dynamic_smem_size feature means that low-latency allocator will not work with the PTX 3.1 multilib. libgomp/ChangeLog: * allocator.c (MEMSPACE_ALLOC): New macro. (MEMSPACE_CALLOC): New macro. (MEMSPACE_REALLOC): New macro. (MEMSPACE_FREE): New macro. (dynamic_smem_size): New constants. (omp_alloc): Use MEMSPACE_ALLOC. Implement fall-backs for predefined allocators. (omp_free): Use MEMSPACE_FREE. (omp_calloc): Use MEMSPACE_CALLOC. Implement fall-backs for predefined allocators. (omp_realloc): Use MEMSPACE_REALLOC and MEMSPACE_ALLOC.. Implement fall-backs for predefined allocators. * config/nvptx/team.c (__nvptx_lowlat_heap_root): New variable. (__nvptx_lowlat_pool): New asm varaible. (gomp_nvptx_main): Initialize the low-latency heap. * plugin/plugin-nvptx.c (lowlat_pool_size): New variable. (GOMP_OFFLOAD_init_device): Read the GOMP_NVPTX_LOWLAT_POOL envvar. (GOMP_OFFLOAD_run): Apply lowlat_pool_size. * config/nvptx/allocator.c: New file. * testsuite/libgomp.c/allocators-1.c: New test. * testsuite/libgomp.c/allocators-2.c: New test. * testsuite/libgomp.c/allocators-3.c: New test. * testsuite/libgomp.c/allocators-4.c: New test. * testsuite/libgomp.c/allocators-5.c: New test. * testsuite/libgomp.c/allocators-6.c: New test. co-authored-by: Kwok Cheung Yeung --- libgomp/allocator.c| 235 - libgomp/config/nvptx/allocator.c | 370 + libgomp/config/nvptx/team.c| 28 ++ libgomp/plugin/plugin-nvptx.c | 23 +- libgomp/testsuite/libgomp.c/allocators-1.c | 56 libgomp/testsuite/libgomp.c/allocators-2.c | 64 libgomp/testsuite/libgomp.c/allocators-3.c | 42 +++ libgomp/testsuite/libgomp.c/allocators-4.c | 196 +++ libgomp/testsuite/libgomp.c/allocators-5.c | 63 libgomp/testsuite/libgomp.c/allocators-6.c | 117 +++ 10 files changed, 1110 insertions(+), 84 deletions(-) create mode 100644 libgomp/config/nvptx/allocator.c create mode 100644 libgomp/testsuite/libgomp.c/allocators-1.c create mode 100644 libgomp/testsuite/libgomp.c/allocators-2.c create mode 100644 libgomp/testsuite/libgomp.c/allocators-3.c create mode 100644 libgomp/testsuite/libgomp.c/allocators-4.c create mode 100644 libgomp/testsuite/libgomp.c/allocators-5.c create mode 100644 libgomp/testsuite/libgomp.c/allocators-6.c diff --git a/libgomp/allocator.c b/libgomp/allocator.c index b04820b8cf9..9b33bcf529b 100644 --- a/libgomp/allocator.c +++ b/libgomp/allocator.c @@ -37,6 +37,34 @@ #define omp_max_predefined_alloc omp_thread_mem_alloc +/* These macros may be overridden in config//allocator.c. */ +#ifndef MEMSPACE_ALLOC +#define MEMSPACE_ALLOC(MEMSPACE, SIZE) malloc (SIZE) +#endif +#ifndef MEMSPACE_CALLOC +#define MEMSPACE_CALLOC(MEMSPACE, SIZE) calloc (1, SIZE) +#endif +#ifndef MEMSPACE_REALLOC +#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) realloc (ADDR, SIZE) +#endif +#ifndef MEMSPACE_FREE +#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) free (ADDR) +#endif + +/* Map the predefined allocators to the correct memory space. + The index to this table is the omp_allocator_handle_t enum value. */ +static const omp_memspace_handle_t predefined_alloc_mapping[] = { + omp_default_mem_space, /* omp_null_allocator. */ + omp_default_mem_space, /* omp_default_mem_alloc. */ + omp_large_cap_mem_space, /* omp_large_cap_mem_alloc. */ + omp_default_mem_space, /* omp_const_mem_alloc. */ + omp_high_bw_mem_space, /* omp_high_bw_mem_alloc. */ + omp_low_lat_mem_space, /* omp_low_lat_mem_alloc. */ + omp_low_lat_mem_space, /* omp_cgroup_mem_alloc. */ + omp_low_lat_mem_space, /* omp_pteam_mem_alloc. */ + omp_low_lat_mem_space, /* omp_thread_mem_alloc. */ +}; + enum gomp_memkind_kind { GOMP_MEMKIND_NONE = 0, @@ -453,7 +481,7 @@ retry: } else #endif - ptr = malloc (new_size); + ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size); if (ptr == NULL) { #ifdef HAVE_SYNC_BUILTINS @@ -478,7 +506,13 @@ retry: } else #endif - ptr = malloc (new_size); + { + omp_memspace_handle_t memspace __attribute__((unused)) + = (allocator_data + ? allocator_data->memspace + : predefined_alloc_mapping[allocator]); + ptr = MEMSPACE_ALLOC (memspace, new_size); + } if (ptr == NULL) goto fail; } @@ -496,35 +530,38 @@ retry: return ret; fail: - if (allocator_data) + int fallback = (allocator_da
[PATCH 04/17] openmp, nvptx: low-lat memory access traits
The NVPTX low latency memory is not accessible outside the team that allocates it, and therefore should be unavailable for allocators with the access trait "all". This change means that the omp_low_lat_mem_alloc predefined allocator now implicitly implies the "pteam" trait. libgomp/ChangeLog: * allocator.c (MEMSPACE_VALIDATE): New macro. (omp_aligned_alloc): Use MEMSPACE_VALIDATE. (omp_aligned_calloc): Likewise. (omp_realloc): Likewise. * config/nvptx/allocator.c (nvptx_memspace_validate): New function. (MEMSPACE_VALIDATE): New macro. * testsuite/libgomp.c/allocators-4.c (main): Add access trait. * testsuite/libgomp.c/allocators-6.c (main): Add access trait. * testsuite/libgomp.c/allocators-7.c: New test. --- libgomp/allocator.c| 15 + libgomp/config/nvptx/allocator.c | 11 libgomp/testsuite/libgomp.c/allocators-4.c | 7 ++- libgomp/testsuite/libgomp.c/allocators-6.c | 7 ++- libgomp/testsuite/libgomp.c/allocators-7.c | 68 ++ 5 files changed, 102 insertions(+), 6 deletions(-) create mode 100644 libgomp/testsuite/libgomp.c/allocators-7.c diff --git a/libgomp/allocator.c b/libgomp/allocator.c index 029d0d40a36..48ab0782e6b 100644 --- a/libgomp/allocator.c +++ b/libgomp/allocator.c @@ -54,6 +54,9 @@ #define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \ (PIN ? NULL : free (ADDR)) #endif +#ifndef MEMSPACE_VALIDATE +#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) 1 +#endif /* Map the predefined allocators to the correct memory space. The index to this table is the omp_allocator_handle_t enum value. */ @@ -438,6 +441,10 @@ retry: if (__builtin_add_overflow (size, new_size, &new_size)) goto fail; + if (allocator_data + && !MEMSPACE_VALIDATE (allocator_data->memspace, allocator_data->access)) +goto fail; + if (__builtin_expect (allocator_data && allocator_data->pool_size < ~(uintptr_t) 0, 0)) { @@ -733,6 +740,10 @@ retry: if (__builtin_add_overflow (size_temp, new_size, &new_size)) goto fail; + if (allocator_data + && !MEMSPACE_VALIDATE (allocator_data->memspace, allocator_data->access)) +goto fail; + if (__builtin_expect (allocator_data && allocator_data->pool_size < ~(uintptr_t) 0, 0)) { @@ -964,6 +975,10 @@ retry: goto fail; old_size = data->size; + if (allocator_data + && !MEMSPACE_VALIDATE (allocator_data->memspace, allocator_data->access)) +goto fail; + if (__builtin_expect (allocator_data && allocator_data->pool_size < ~(uintptr_t) 0, 0)) { diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c index f740b97f6ac..0102680b717 100644 --- a/libgomp/config/nvptx/allocator.c +++ b/libgomp/config/nvptx/allocator.c @@ -358,6 +358,15 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr, return realloc (addr, size); } +static inline int +nvptx_memspace_validate (omp_memspace_handle_t memspace, unsigned access) +{ + /* Disallow use of low-latency memory when it must be accessible by + all threads. */ + return (memspace != omp_low_lat_mem_space + || access != omp_atv_all); +} + #define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \ nvptx_memspace_alloc (MEMSPACE, SIZE) #define MEMSPACE_CALLOC(MEMSPACE, SIZE, PIN) \ @@ -366,5 +375,7 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr, nvptx_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE) #define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \ nvptx_memspace_free (MEMSPACE, ADDR, SIZE) +#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \ + nvptx_memspace_validate (MEMSPACE, ACCESS) #include "../../allocator.c" diff --git a/libgomp/testsuite/libgomp.c/allocators-4.c b/libgomp/testsuite/libgomp.c/allocators-4.c index 9fa6aa1624f..cae27ea33c1 100644 --- a/libgomp/testsuite/libgomp.c/allocators-4.c +++ b/libgomp/testsuite/libgomp.c/allocators-4.c @@ -23,10 +23,11 @@ main () #pragma omp target { /* Ensure that the memory we get *is* low-latency with a null-fallback. */ -omp_alloctrait_t traits[1] - = { { omp_atk_fallback, omp_atv_null_fb } }; +omp_alloctrait_t traits[2] + = { { omp_atk_fallback, omp_atv_null_fb }, + { omp_atk_access, omp_atv_pteam } }; omp_allocator_handle_t lowlat = omp_init_allocator (omp_low_lat_mem_space, - 1, traits); + 2, traits); int size = 4; diff --git a/libgomp/testsuite/libgomp.c/allocators-6.c b/libgomp/testsuite/libgomp.c/allocators-6.c index 90bf73095ef..c03233df582 100644 --- a/libgomp/testsuite/libgomp.c/allocators-6.c +++ b/libgomp/testsuite/libgomp.c/allocators-6.c @@ -23,10 +23,11 @@ main () #pragma omp target { /* Ensure that the memory we get *is* low-latency with a null-fallback. */ -omp_alloctrait_t traits[1] - = { { omp_atk_fallback, omp_atv_null_fb } }; +omp_alloctrait_t traits[2] + = { { omp_atk_fallback, omp_atv_null_fb },
[PATCH 03/17] libgomp, openmp: Add ompx_pinned_mem_alloc
This creates a new predefined allocator as a shortcut for using pinned memory with OpenMP. The name uses the OpenMP extension space and is intended to be consistent with other OpenMP implementations currently in development. The allocator is equivalent to using a custom allocator with the pinned trait and the null fallback trait. libgomp/ChangeLog: * allocator.c (omp_max_predefined_alloc): Update. (omp_aligned_alloc): Support ompx_pinned_mem_alloc. (omp_free): Likewise. (omp_aligned_calloc): Likewise. (omp_realloc): Likewise. * omp.h.in (omp_allocator_handle_t): Add ompx_pinned_mem_alloc. * omp_lib.f90.in: Add ompx_pinned_mem_alloc. * testsuite/libgomp.c/alloc-pinned-5.c: New test. * testsuite/libgomp.c/alloc-pinned-6.c: New test. * testsuite/libgomp.fortran/alloc-pinned-1.f90: New test. --- libgomp/allocator.c | 60 +++ libgomp/omp.h.in | 1 + libgomp/omp_lib.f90.in| 2 + libgomp/testsuite/libgomp.c/alloc-pinned-5.c | 90 libgomp/testsuite/libgomp.c/alloc-pinned-6.c | 101 ++ .../libgomp.fortran/alloc-pinned-1.f90| 16 +++ 6 files changed, 252 insertions(+), 18 deletions(-) create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-5.c create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-6.c create mode 100644 libgomp/testsuite/libgomp.fortran/alloc-pinned-1.f90 diff --git a/libgomp/allocator.c b/libgomp/allocator.c index 54310ab93ca..029d0d40a36 100644 --- a/libgomp/allocator.c +++ b/libgomp/allocator.c @@ -35,7 +35,7 @@ #include #endif -#define omp_max_predefined_alloc omp_thread_mem_alloc +#define omp_max_predefined_alloc ompx_pinned_mem_alloc /* These macros may be overridden in config//allocator.c. */ #ifndef MEMSPACE_ALLOC @@ -67,6 +67,7 @@ static const omp_memspace_handle_t predefined_alloc_mapping[] = { omp_low_lat_mem_space, /* omp_cgroup_mem_alloc. */ omp_low_lat_mem_space, /* omp_pteam_mem_alloc. */ omp_low_lat_mem_space, /* omp_thread_mem_alloc. */ + omp_default_mem_space, /* ompx_pinned_mem_alloc. */ }; enum gomp_memkind_kind @@ -512,8 +513,11 @@ retry: = (allocator_data ? allocator_data->memspace : predefined_alloc_mapping[allocator]); - ptr = MEMSPACE_ALLOC (memspace, new_size, -allocator_data && allocator_data->pinned); + int pinned __attribute__((unused)) + = (allocator_data + ? allocator_data->pinned + : allocator == ompx_pinned_mem_alloc); + ptr = MEMSPACE_ALLOC (memspace, new_size, pinned); } if (ptr == NULL) goto fail; @@ -534,7 +538,8 @@ retry: fail: int fallback = (allocator_data ? allocator_data->fallback - : allocator == omp_default_mem_alloc + : (allocator == omp_default_mem_alloc + || allocator == ompx_pinned_mem_alloc) ? omp_atv_null_fb : omp_atv_default_mem_fb); switch (fallback) @@ -653,6 +658,7 @@ omp_free (void *ptr, omp_allocator_handle_t allocator) #endif memspace = predefined_alloc_mapping[data->allocator]; + pinned = (data->allocator == ompx_pinned_mem_alloc); } MEMSPACE_FREE (memspace, data->ptr, data->size, pinned); @@ -802,8 +808,11 @@ retry: = (allocator_data ? allocator_data->memspace : predefined_alloc_mapping[allocator]); - ptr = MEMSPACE_CALLOC (memspace, new_size, - allocator_data && allocator_data->pinned); + int pinned __attribute__((unused)) + = (allocator_data + ? allocator_data->pinned + : allocator == ompx_pinned_mem_alloc); + ptr = MEMSPACE_CALLOC (memspace, new_size, pinned); } if (ptr == NULL) goto fail; @@ -824,7 +833,8 @@ retry: fail: int fallback = (allocator_data ? allocator_data->fallback - : allocator == omp_default_mem_alloc + : (allocator == omp_default_mem_alloc + || allocator == ompx_pinned_mem_alloc) ? omp_atv_null_fb : omp_atv_default_mem_fb); switch (fallback) @@ -1026,11 +1036,15 @@ retry: else #endif if (prev_size) - new_ptr = MEMSPACE_REALLOC (allocator_data->memspace, data->ptr, -data->size, new_size, -(free_allocator_data - && free_allocator_data->pinned), -allocator_data->pinned); + { + int was_pinned __attribute__((unused)) + = (free_allocator_data + ? free_allocator_data->pinned + : free_allocator == ompx_pinned_mem_alloc); + new_ptr = MEMSPACE_REALLOC (allocator_data->memspace, data->ptr, + data->size, new_size, was_pinned, + allocator_data->pinned); + } else new_ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size, allocator_data->pinned); @@ -1079,10 +1093,16 @@ retry: = (allocator_data ? allocator_data->memspace : predefined_alloc_mapping[allocator]); + int was_pinned __attribute__((unused)) + = (free_allocator_data
[PATCH 09/17] openmp: Use libgomp memory allocation functions with unified shared memory.
This patches changes calls to malloc/free/calloc/realloc and operator new to memory allocation functions in libgomp with allocator=ompx_unified_shared_mem_alloc. This helps existing code to benefit from the unified shared memory. The libgomp does the correct thing with all the mapping constructs and there is no memory copies if the pointer is pointing to unified shared memory. We only replace replacable new operator and not the class member or placement new. gcc/ChangeLog: * omp-low.cc (usm_transform): New function. (make_pass_usm_transform): Likewise. (class pass_usm_transform): New. * passes.def: Add pass_usm_transform. * tree-pass.h (make_pass_usm_transform): New declaration. gcc/testsuite/ChangeLog: * c-c++-common/gomp/usm-2.c: New test. * c-c++-common/gomp/usm-3.c: New test. * g++.dg/gomp/usm-1.C: New test. * g++.dg/gomp/usm-2.C: New test. * g++.dg/gomp/usm-3.C: New test. * gfortran.dg/gomp/usm-2.f90: New test. * gfortran.dg/gomp/usm-3.f90: New test. libgomp/ChangeLog: * testsuite/libgomp.c/usm-6.c: New test. * testsuite/libgomp.c++/usm-1.C: Likewise. co-authored-by: Andrew Stubbs --- gcc/omp-low.cc | 174 +++ gcc/passes.def | 1 + gcc/testsuite/c-c++-common/gomp/usm-2.c | 46 ++ gcc/testsuite/c-c++-common/gomp/usm-3.c | 44 ++ gcc/testsuite/g++.dg/gomp/usm-1.C| 32 + gcc/testsuite/g++.dg/gomp/usm-2.C| 30 gcc/testsuite/g++.dg/gomp/usm-3.C| 38 + gcc/testsuite/gfortran.dg/gomp/usm-2.f90 | 16 +++ gcc/testsuite/gfortran.dg/gomp/usm-3.f90 | 13 ++ gcc/tree-pass.h | 1 + libgomp/testsuite/libgomp.c++/usm-1.C| 54 +++ libgomp/testsuite/libgomp.c/usm-6.c | 92 12 files changed, 541 insertions(+) create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-2.c create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-3.c create mode 100644 gcc/testsuite/g++.dg/gomp/usm-1.C create mode 100644 gcc/testsuite/g++.dg/gomp/usm-2.C create mode 100644 gcc/testsuite/g++.dg/gomp/usm-3.C create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-2.f90 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-3.f90 create mode 100644 libgomp/testsuite/libgomp.c++/usm-1.C create mode 100644 libgomp/testsuite/libgomp.c/usm-6.c diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc index ba612e5c67d..cdadd6f0c96 100644 --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -15097,6 +15097,180 @@ make_pass_diagnose_omp_blocks (gcc::context *ctxt) { return new pass_diagnose_omp_blocks (ctxt); } + +/* Provide transformation required for using unified shared memory + by replacing calls to standard memory allocation functions with + function provided by the libgomp. */ + +static tree +usm_transform (gimple_stmt_iterator *gsi_p, bool *, + struct walk_stmt_info *wi) +{ + gimple *stmt = gsi_stmt (*gsi_p); + /* ompx_unified_shared_mem_alloc is 10. */ + const unsigned int unified_shared_mem_alloc = 10; + + switch (gimple_code (stmt)) +{ +case GIMPLE_CALL: + { + gcall *gs = as_a (stmt); + tree fndecl = gimple_call_fndecl (gs); + if (fndecl) + { + tree allocator = build_int_cst (pointer_sized_int_node, + unified_shared_mem_alloc); + const char *name = IDENTIFIER_POINTER (DECL_NAME (fndecl)); + if ((strcmp (name, "malloc") == 0) + || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL) + && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_MALLOC) + || DECL_IS_REPLACEABLE_OPERATOR_NEW_P (fndecl) + || strcmp (name, "omp_target_alloc") == 0) + { + tree omp_alloc_type + = build_function_type_list (ptr_type_node, size_type_node, + pointer_sized_int_node, + NULL_TREE); + tree repl = build_fn_decl ("omp_alloc", omp_alloc_type); + tree size = gimple_call_arg (gs, 0); + gimple *g = gimple_build_call (repl, 2, size, allocator); + gimple_call_set_lhs (g, gimple_call_lhs (gs)); + gimple_set_location (g, gimple_location (stmt)); + gsi_replace (gsi_p, g, true); + } + else if (strcmp (name, "aligned_alloc") == 0) + { + /* May be we can also use this for new operator with + std::align_val_t parameter. */ + tree omp_alloc_type + = build_function_type_list (ptr_type_node, size_type_node, + size_type_node, + pointer_sized_int_node, + NULL_TREE); + tree repl = build_fn_decl ("omp_aligned_alloc", + omp_alloc_type); + tree align = gimple_call_arg (gs, 0); + tree size = gimple_call_arg (gs, 1); + gimple *g = gimple_build_call (repl, 3, align, size, + allocator); + gimple_call_set_lhs (g, gimple_call_lhs (gs)); + gimple_set_location (g, gimple_location (stmt)); + gsi_replace (gsi_p, g, true); + } + else if ((strcmp (name, "calloc&
[PATCH 06/17] openmp: Add -foffload-memory
Add a new option. It's inactive until I add some follow-up patches. gcc/ChangeLog: * common.opt: Add -foffload-memory and its enum values. * coretypes.h (enum offload_memory): New. * doc/invoke.texi: Document -foffload-memory. --- gcc/common.opt | 16 gcc/coretypes.h | 7 +++ gcc/doc/invoke.texi | 16 +++- 3 files changed, 38 insertions(+), 1 deletion(-) diff --git a/gcc/common.opt b/gcc/common.opt index e7a51e882ba..8d76980fbbb 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -2213,6 +2213,22 @@ Enum(offload_abi) String(ilp32) Value(OFFLOAD_ABI_ILP32) EnumValue Enum(offload_abi) String(lp64) Value(OFFLOAD_ABI_LP64) +foffload-memory= +Common Joined RejectNegative Enum(offload_memory) Var(flag_offload_memory) Init(OFFLOAD_MEMORY_NONE) +-foffload-memory=[none|unified|pinned] Use an offload memory optimization. + +Enum +Name(offload_memory) Type(enum offload_memory) UnknownError(Unknown offload memory option %qs) + +EnumValue +Enum(offload_memory) String(none) Value(OFFLOAD_MEMORY_NONE) + +EnumValue +Enum(offload_memory) String(unified) Value(OFFLOAD_MEMORY_UNIFIED) + +EnumValue +Enum(offload_memory) String(pinned) Value(OFFLOAD_MEMORY_PINNED) + fomit-frame-pointer Common Var(flag_omit_frame_pointer) Optimization When possible do not generate stack frames. diff --git a/gcc/coretypes.h b/gcc/coretypes.h index 08b9ac9094c..dd52d5bb113 100644 --- a/gcc/coretypes.h +++ b/gcc/coretypes.h @@ -206,6 +206,13 @@ enum offload_abi { OFFLOAD_ABI_ILP32 }; +/* Types of memory optimization for an offload device. */ +enum offload_memory { + OFFLOAD_MEMORY_NONE, + OFFLOAD_MEMORY_UNIFIED, + OFFLOAD_MEMORY_PINNED +}; + /* Types of profile update methods. */ enum profile_update { PROFILE_UPDATE_SINGLE, diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index d5ff1018372..3df39bb06e3 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -202,7 +202,7 @@ in the following sections. -fno-builtin -fno-builtin-@var{function} -fcond-mismatch @gol -ffreestanding -fgimple -fgnu-tm -fgnu89-inline -fhosted @gol -flax-vector-conversions -fms-extensions @gol --foffload=@var{arg} -foffload-options=@var{arg} @gol +-foffload=@var{arg} -foffload-options=@var{arg} -foffload-memory=@var{arg} @gol -fopenacc -fopenacc-dim=@var{geom} @gol -fopenmp -fopenmp-simd @gol -fpermitted-flt-eval-methods=@var{standard} @gol @@ -2708,6 +2708,20 @@ Typical command lines are -foffload-options=amdgcn-amdhsa=-march=gfx906 -foffload-options=-lm @end smallexample +@item -foffload-memory=none +@itemx -foffload-memory=unified +@itemx -foffload-memory=pinned +@opindex foffload-memory +@cindex OpenMP offloading memory modes +Enable a memory optimization mode to use with OpenMP. The default behavior, +@option{-foffload-memory=none}, is to do nothing special (unless enabled via +a requires directive in the code). @option{-foffload-memory=unified} is +equivalent to @code{#pragma omp requires unified_shared_memory}. +@option{-foffload-memory=pinned} forces all host memory to be pinned (this +mode may require the user to increase the ulimit setting for locked memory). +All translation units must select the same setting to avoid undefined +behavior. + @item -fopenacc @opindex fopenacc @cindex OpenACC accelerator programming
[PATCH 05/17] openmp, nvptx: ompx_unified_shared_mem_alloc
This adds support for using Cuda Managed Memory with omp_alloc. It will be used as the underpinnings for "requires unified_shared_memory" in a later patch. There are two new predefined allocators, ompx_unified_shared_mem_alloc and ompx_host_mem_alloc, plus corresponding memory spaces, which can be used to allocate memory in the "managed" space and explicitly on the host (it is intended that "malloc" will be intercepted by the compiler). The nvptx plugin is modified to make the necessary Cuda calls, and libgomp is modified to switch to shared-memory mode for USM allocated mappings. include/ChangeLog: * cuda/cuda.h (CUdevice_attribute): Add definitions for CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR and CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR. (CUmemAttach_flags): New. (CUpointer_attribute): New. (cuMemAllocManaged): New prototype. (cuPointerGetAttribute): New prototype. libgomp/ChangeLog: * allocator.c (omp_max_predefined_alloc): Update. (omp_aligned_alloc): Don't fallback ompx_host_mem_alloc. (omp_aligned_calloc): Likewise. (omp_realloc): Likewise. * config/linux/allocator.c (linux_memspace_alloc): Handle USM. (linux_memspace_calloc): Handle USM. (linux_memspace_free): Handle USM. (linux_memspace_realloc): Handle USM. * config/nvptx/allocator.c (nvptx_memspace_alloc): Reject ompx_host_mem_alloc. (nvptx_memspace_calloc): Likewise. (nvptx_memspace_realloc): Likewise. * libgomp-plugin.h (GOMP_OFFLOAD_usm_alloc): New prototype. (GOMP_OFFLOAD_usm_free): New prototype. (GOMP_OFFLOAD_is_usm_ptr): New prototype. * libgomp.h (gomp_usm_alloc): New prototype. (gomp_usm_free): New prototype. (gomp_is_usm_ptr): New prototype. (struct gomp_device_descr): Add USM functions. * omp.h.in (omp_memspace_handle_t): Add ompx_unified_shared_mem_space and ompx_host_mem_space. (omp_allocator_handle_t): Add ompx_unified_shared_mem_alloc and ompx_host_mem_alloc. * omp_lib.f90.in: Likewise. * plugin/cuda-lib.def (cuMemAllocManaged): Add new call. (cuPointerGetAttribute): Likewise. * plugin/plugin-nvptx.c (nvptx_alloc): Add "usm" parameter. Call cuMemAllocManaged as appropriate. (GOMP_OFFLOAD_get_num_devices): Allow GOMP_REQUIRES_UNIFIED_ADDRESS and GOMP_REQUIRES_UNIFIED_SHARED_MEMORY. (GOMP_OFFLOAD_alloc): Move internals to ... (GOMP_OFFLOAD_alloc_1): ... this, and add usm parameter. (GOMP_OFFLOAD_usm_alloc): New function. (GOMP_OFFLOAD_usm_free): New function. (GOMP_OFFLOAD_is_usm_ptr): New function. * target.c (gomp_map_vars_internal): Add USM support. (gomp_usm_alloc): New function. (gomp_usm_free): New function. (gomp_load_plugin_for_device): New function. * testsuite/libgomp.c/usm-1.c: New test. * testsuite/libgomp.c/usm-2.c: New test. * testsuite/libgomp.c/usm-3.c: New test. * testsuite/libgomp.c/usm-4.c: New test. * testsuite/libgomp.c/usm-5.c: New test. co-authored-by: Kwok Cheung Yeung squash! openmp, nvptx: ompx_unified_shared_mem_alloc --- include/cuda/cuda.h | 12 ++ libgomp/allocator.c | 13 -- libgomp/config/linux/allocator.c| 48 ++ libgomp/config/nvptx/allocator.c| 6 +++ libgomp/libgomp-plugin.h| 3 ++ libgomp/libgomp.h | 6 +++ libgomp/omp.h.in| 4 ++ libgomp/omp_lib.f90.in | 8 libgomp/plugin/cuda-lib.def | 2 + libgomp/plugin/plugin-nvptx.c | 47 ++--- libgomp/target.c| 64 + libgomp/testsuite/libgomp.c/usm-1.c | 24 +++ libgomp/testsuite/libgomp.c/usm-2.c | 32 +++ libgomp/testsuite/libgomp.c/usm-3.c | 35 libgomp/testsuite/libgomp.c/usm-4.c | 36 libgomp/testsuite/libgomp.c/usm-5.c | 28 + 16 files changed, 340 insertions(+), 28 deletions(-) create mode 100644 libgomp/testsuite/libgomp.c/usm-1.c create mode 100644 libgomp/testsuite/libgomp.c/usm-2.c create mode 100644 libgomp/testsuite/libgomp.c/usm-3.c create mode 100644 libgomp/testsuite/libgomp.c/usm-4.c create mode 100644 libgomp/testsuite/libgomp.c/usm-5.c diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h index 3938d05d150..8135e7c9247 100644 --- a/include/cuda/cuda.h +++ b/include/cuda/cuda.h @@ -77,9 +77,19 @@ typedef enum { CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39, CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40, + CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = 75, + CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR = 76, CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
[PATCH 12/17] Handle cleanup of omp allocated variables (OpenMP 5.0).
Currently we are only handling omp allocate directive that is associated with an allocate statement. This statement results in malloc and free calls. The malloc calls are easy to get to as they are in the same block as allocate directive. But the free calls come in a separate cleanup block. To help any later passes finding them, an allocate directive is generated in the cleanup block with kind=free. The normal allocate directive is given kind=allocate. gcc/fortran/ChangeLog: * gfortran.h (struct access_ref): Declare new members omp_allocated and omp_allocated_end. * openmp.cc (gfc_match_omp_allocate): Set new_st.resolved_sym to NULL. (prepare_omp_allocated_var_list_for_cleanup): New function. (gfc_resolve_omp_allocate): Call it. * trans-decl.cc (gfc_trans_deferred_vars): Process omp_allocated. * trans-openmp.cc (gfc_trans_omp_allocate): Set kind for the stmt generated for allocate directive. gcc/ChangeLog: * tree-core.h (struct tree_base): Add comments. * tree-pretty-print.cc (dump_generic_node): Handle allocate directive kind. * tree.h (OMP_ALLOCATE_KIND_ALLOCATE): New define. (OMP_ALLOCATE_KIND_FREE): Likewise. gcc/testsuite/ChangeLog: * gfortran.dg/gomp/allocate-6.f90: Test kind of allocate directive. --- gcc/fortran/gfortran.h| 1 + gcc/fortran/openmp.cc | 30 +++ gcc/fortran/trans-decl.cc | 20 + gcc/fortran/trans-openmp.cc | 6 gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 | 3 +- gcc/tree-core.h | 6 gcc/tree-pretty-print.cc | 4 +++ gcc/tree.h| 4 +++ 8 files changed, 73 insertions(+), 1 deletion(-) diff --git a/gcc/fortran/gfortran.h b/gcc/fortran/gfortran.h index 755469185a6..c6f58341cf3 100644 --- a/gcc/fortran/gfortran.h +++ b/gcc/fortran/gfortran.h @@ -1829,6 +1829,7 @@ typedef struct gfc_symbol gfc_array_spec *as; struct gfc_symbol *result; /* function result symbol */ gfc_component *components; /* Derived type components */ + gfc_omp_namelist *omp_allocated, *omp_allocated_end; /* Defined only for Cray pointees; points to their pointer. */ struct gfc_symbol *cp_pointer; diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc index 38003890bb0..4c94bc763b5 100644 --- a/gcc/fortran/openmp.cc +++ b/gcc/fortran/openmp.cc @@ -6057,6 +6057,7 @@ gfc_match_omp_allocate (void) new_st.op = EXEC_OMP_ALLOCATE; new_st.ext.omp_clauses = c; + new_st.resolved_sym = NULL; gfc_free_expr (allocator); return MATCH_YES; } @@ -9548,6 +9549,34 @@ gfc_resolve_oacc_routines (gfc_namespace *ns) } } +static void +prepare_omp_allocated_var_list_for_cleanup (gfc_omp_namelist *cn, locus loc) +{ + gfc_symbol *proc = cn->sym->ns->proc_name; + gfc_omp_namelist *p, *n; + + for (n = cn; n; n = n->next) +{ + if (n->sym->attr.allocatable && !n->sym->attr.save + && !n->sym->attr.result && !proc->attr.is_main_program) + { + p = gfc_get_omp_namelist (); + p->sym = n->sym; + p->expr = gfc_copy_expr (n->expr); + p->where = loc; + p->next = NULL; + if (proc->omp_allocated == NULL) + proc->omp_allocated_end = proc->omp_allocated = p; + else + { + proc->omp_allocated_end->next = p; + proc->omp_allocated_end = p; + } + + } +} +} + static void check_allocate_directive_restrictions (gfc_symbol *sym, gfc_expr *omp_al, gfc_namespace *ns, locus loc) @@ -9678,6 +9707,7 @@ gfc_resolve_omp_allocate (gfc_code *code, gfc_namespace *ns) code->loc); } } + prepare_omp_allocated_var_list_for_cleanup (cn, code->loc); } diff --git a/gcc/fortran/trans-decl.cc b/gcc/fortran/trans-decl.cc index 6493cc2f6b1..326365f22fc 100644 --- a/gcc/fortran/trans-decl.cc +++ b/gcc/fortran/trans-decl.cc @@ -4588,6 +4588,26 @@ gfc_trans_deferred_vars (gfc_symbol * proc_sym, gfc_wrapped_block * block) } } + /* Generate a dummy allocate pragma with free kind so that cleanup + of those variables which were allocated using the allocate statement + associated with an allocate clause happens correctly. */ + + if (proc_sym->omp_allocated) +{ + gfc_clear_new_st (); + new_st.op = EXEC_OMP_ALLOCATE; + gfc_omp_clauses *c = gfc_get_omp_clauses (); + c->lists[OMP_LIST_ALLOCATOR] = proc_sym->omp_allocated; + new_st.ext.omp_clauses = c; + /* This is just a hacky way to convey to handler that we are + dealing with cleanup here. Saves us from using another field + for it. */ + new_st.resolved_sym = proc_sym->omp_allocated->sym; + gfc_add_init_cleanup (block, NULL, + gfc_trans_omp_directive (&new_st)); + gfc_free_omp_clauses (c); + proc_sym->omp_allocated = NULL; +} /* Initialize the INTENT(OUT) derived type dummy argu
[PATCH 07/17] openmp: allow requires unified_shared_memory
This is the front-end portion of the Unified Shared Memory implementation. It removes the "sorry, unimplemented message" in C, C++, and Fortran, and sets flag_offload_memory, but is otherwise inactive, for now. It also checks that -foffload-memory isn't set to an incompatible mode. gcc/c/ChangeLog: * c-parser.cc (c_parser_omp_requires): Allow "requires unified_share_memory" and "unified_address". gcc/cp/ChangeLog: * parser.cc (cp_parser_omp_requires): Allow "requires unified_share_memory" and "unified_address". gcc/fortran/ChangeLog: * openmp.cc (gfc_match_omp_requires): Allow "requires unified_share_memory" and "unified_address". gcc/testsuite/ChangeLog: * c-c++-common/gomp/usm-1.c: New test. * c-c++-common/gomp/usm-4.c: New test. * gfortran.dg/gomp/usm-1.f90: New test. * gfortran.dg/gomp/usm-4.f90: New test. --- gcc/c/c-parser.cc| 22 -- gcc/cp/parser.cc | 22 -- gcc/fortran/openmp.cc| 13 + gcc/testsuite/c-c++-common/gomp/usm-1.c | 4 gcc/testsuite/c-c++-common/gomp/usm-4.c | 4 gcc/testsuite/gfortran.dg/gomp/usm-1.f90 | 6 ++ gcc/testsuite/gfortran.dg/gomp/usm-4.f90 | 6 ++ 7 files changed, 73 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-1.c create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-4.c create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-1.f90 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-4.f90 diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc index 9c02141e2c6..c30f67cd2da 100644 --- a/gcc/c/c-parser.cc +++ b/gcc/c/c-parser.cc @@ -22726,9 +22726,27 @@ c_parser_omp_requires (c_parser *parser) enum omp_requires this_req = (enum omp_requires) 0; if (!strcmp (p, "unified_address")) - this_req = OMP_REQUIRES_UNIFIED_ADDRESS; + { + this_req = OMP_REQUIRES_UNIFIED_ADDRESS; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + error_at (cloc, + "% is incompatible with the " + "selected %<-foffload-memory%> option"); + flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; + } else if (!strcmp (p, "unified_shared_memory")) - this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY; + { + this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + error_at (cloc, + "% is incompatible with the " + "selected %<-foffload-memory%> option"); + flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; + } else if (!strcmp (p, "dynamic_allocators")) this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS; else if (!strcmp (p, "reverse_offload")) diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc index df657a3fb2b..3deafc7c928 100644 --- a/gcc/cp/parser.cc +++ b/gcc/cp/parser.cc @@ -46860,9 +46860,27 @@ cp_parser_omp_requires (cp_parser *parser, cp_token *pragma_tok) enum omp_requires this_req = (enum omp_requires) 0; if (!strcmp (p, "unified_address")) - this_req = OMP_REQUIRES_UNIFIED_ADDRESS; + { + this_req = OMP_REQUIRES_UNIFIED_ADDRESS; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + error_at (cloc, + "% is incompatible with the " + "selected %<-foffload-memory%> option"); + flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; + } else if (!strcmp (p, "unified_shared_memory")) - this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY; + { + this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + error_at (cloc, + "% is incompatible with the " + "selected %<-foffload-memory%> option"); + flag_offload_memory = OFFLOAD_MEMORY_UNIFIED; + } else if (!strcmp (p, "dynamic_allocators")) this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS; else if (!strcmp (p, "reverse_offload")) diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc index bd4ff259fe0..91bf8a3c50d 100644 --- a/gcc/fortran/openmp.cc +++ b/gcc/fortran/openmp.cc @@ -29,6 +29,7 @@ along with GCC; see the file COPYING3. If not see #include "diagnostic.h" #include "gomp-constants.h" #include "target-memory.h" /* For gfc_encode_character. */ +#include "options.h" /* Match an end of OpenMP directive. End of OpenMP directive is optional whitespace, followed by '\n' or comment '!'. */ @@ -5556,6 +5557,12 @@ gfc_match_omp_requires (void) requires_clause = OMP_REQ_UNIFIED_ADDRESS; if (requires_clauses & OMP_REQ_UNIFIED_ADDRESS) goto duplicate_clause; + + if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED + && flag_offload_memory != OFFLOAD_MEMORY_NONE) + gfc_error_now ("unified_address
[PATCH 11/17] Translate allocate directive (OpenMP 5.0).
gcc/fortran/ChangeLog: * trans-openmp.cc (gfc_trans_omp_clauses): Handle OMP_LIST_ALLOCATOR. (gfc_trans_omp_allocate): New function. (gfc_trans_omp_directive): Handle EXEC_OMP_ALLOCATE. gcc/ChangeLog: * tree-pretty-print.cc (dump_omp_clause): Handle OMP_CLAUSE_ALLOCATOR. (dump_generic_node): Handle OMP_ALLOCATE. * tree.def (OMP_ALLOCATE): New. * tree.h (OMP_ALLOCATE_CLAUSES): Likewise. (OMP_ALLOCATE_DECL): Likewise. (OMP_ALLOCATE_ALLOCATOR): Likewise. * tree.cc (omp_clause_num_ops): Add entry for OMP_CLAUSE_ALLOCATOR. gcc/testsuite/ChangeLog: * gfortran.dg/gomp/allocate-6.f90: New test. --- gcc/fortran/trans-openmp.cc | 44 gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 | 72 +++ gcc/tree-core.h | 3 + gcc/tree-pretty-print.cc | 19 + gcc/tree.cc | 1 + gcc/tree.def | 4 ++ gcc/tree.h| 11 +++ 7 files changed, 154 insertions(+) create mode 100644 gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 diff --git a/gcc/fortran/trans-openmp.cc b/gcc/fortran/trans-openmp.cc index de27ed52c02..3ee63e416ed 100644 --- a/gcc/fortran/trans-openmp.cc +++ b/gcc/fortran/trans-openmp.cc @@ -2728,6 +2728,28 @@ gfc_trans_omp_clauses (stmtblock_t *block, gfc_omp_clauses *clauses, } } break; + case OMP_LIST_ALLOCATOR: + for (; n != NULL; n = n->next) + if (n->sym->attr.referenced) + { + tree t = gfc_trans_omp_variable (n->sym, false); + if (t != error_mark_node) + { + tree node = build_omp_clause (input_location, + OMP_CLAUSE_ALLOCATOR); + OMP_ALLOCATE_DECL (node) = t; + if (n->expr) + { + tree allocator_; + gfc_init_se (&se, NULL); + gfc_conv_expr (&se, n->expr); + allocator_ = gfc_evaluate_now (se.expr, block); + OMP_ALLOCATE_ALLOCATOR (node) = allocator_; + } + omp_clauses = gfc_trans_add_clause (node, omp_clauses); + } + } + break; case OMP_LIST_LINEAR: { gfc_expr *last_step_expr = NULL; @@ -4982,6 +5004,26 @@ gfc_trans_omp_atomic (gfc_code *code) return gfc_finish_block (&block); } +static tree +gfc_trans_omp_allocate (gfc_code *code) +{ + stmtblock_t block; + tree stmt; + + gfc_omp_clauses *clauses = code->ext.omp_clauses; + gcc_assert (clauses); + + gfc_start_block (&block); + stmt = make_node (OMP_ALLOCATE); + TREE_TYPE (stmt) = void_type_node; + OMP_ALLOCATE_CLAUSES (stmt) = gfc_trans_omp_clauses (&block, clauses, + code->loc, false, + true); + gfc_add_expr_to_block (&block, stmt); + gfc_merge_block_scope (&block); + return gfc_finish_block (&block); +} + static tree gfc_trans_omp_barrier (void) { @@ -7488,6 +7530,8 @@ gfc_trans_omp_directive (gfc_code *code) { switch (code->op) { +case EXEC_OMP_ALLOCATE: + return gfc_trans_omp_allocate (code); case EXEC_OMP_ATOMIC: return gfc_trans_omp_atomic (code); case EXEC_OMP_BARRIER: diff --git a/gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 b/gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 new file mode 100644 index 000..2de2b52ee44 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 @@ -0,0 +1,72 @@ +! { dg-do compile } +! { dg-additional-options "-fdump-tree-original" } + +module omp_lib_kinds + use iso_c_binding, only: c_int, c_intptr_t + implicit none + private :: c_int, c_intptr_t + integer, parameter :: omp_allocator_handle_kind = c_intptr_t + + integer (kind=omp_allocator_handle_kind), & + parameter :: omp_null_allocator = 0 + integer (kind=omp_allocator_handle_kind), & + parameter :: omp_default_mem_alloc = 1 + integer (kind=omp_allocator_handle_kind), & + parameter :: omp_large_cap_mem_alloc = 2 + integer (kind=omp_allocator_handle_kind), & + parameter :: omp_const_mem_alloc = 3 + integer (kind=omp_allocator_handle_kind), & + parameter :: omp_high_bw_mem_alloc = 4 + integer (kind=omp_allocator_handle_kind), & + parameter :: omp_low_lat_mem_alloc = 5 + integer (kind=omp_allocator_handle_kind), & + parameter :: omp_cgroup_mem_alloc = 6 + integer (kind=omp_allocator_handle_kind), & + parameter :: omp_pteam_mem_alloc = 7 + integer (kind=omp_allocator_handle_kind), & + parameter :: omp_thread_mem_alloc = 8 +end module + + +subroutine foo(x, y, al) + use omp_lib_kinds + implicit none + +type :: my_type + integer :: i + integer :: j + real :: x +end type + + integer :: x + integer :: y + integer (kind=omp_allocator_handle_kind) :: al + + integer, allocatable :: var1 + integer, allocatable :: var2 + real, allocatable :: var3(:,:) + type (my_type), allocatable :: var4 + integer, pointer :: pii, parr(:) + + character, allocatable :: str1a, str1aarr(:) + character(len=5), allocatable :: str5a, str5aarr(:) + + !$
[PATCH 14/17] Lower allocate directive (OpenMP 5.0).
This patch looks for malloc/free calls that were generated by allocate statement that is associated with allocate directive and replaces them with GOMP_alloc and GOMP_free. gcc/ChangeLog: * omp-low.cc (scan_sharing_clauses): Handle OMP_CLAUSE_ALLOCATOR. (scan_omp_allocate): New. (scan_omp_1_stmt): Call it. (lower_omp_allocate): New function. (lower_omp_1): Call it. gcc/testsuite/ChangeLog: * gfortran.dg/gomp/allocate-6.f90: Add tests. * gfortran.dg/gomp/allocate-7.f90: New test. * gfortran.dg/gomp/allocate-8.f90: New test. libgomp/ChangeLog: * testsuite/libgomp.fortran/allocate-2.f90: New test. --- gcc/omp-low.cc| 139 ++ gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 | 9 ++ gcc/testsuite/gfortran.dg/gomp/allocate-7.f90 | 13 ++ gcc/testsuite/gfortran.dg/gomp/allocate-8.f90 | 15 ++ .../testsuite/libgomp.fortran/allocate-2.f90 | 48 ++ 5 files changed, 224 insertions(+) create mode 100644 gcc/testsuite/gfortran.dg/gomp/allocate-7.f90 create mode 100644 gcc/testsuite/gfortran.dg/gomp/allocate-8.f90 create mode 100644 libgomp/testsuite/libgomp.fortran/allocate-2.f90 diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc index cdadd6f0c96..7d1a2a0d795 100644 --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -1746,6 +1746,7 @@ scan_sharing_clauses (tree clauses, omp_context *ctx) case OMP_CLAUSE_FINALIZE: case OMP_CLAUSE_TASK_REDUCTION: case OMP_CLAUSE_ALLOCATE: + case OMP_CLAUSE_ALLOCATOR: break; case OMP_CLAUSE_ALIGNED: @@ -1963,6 +1964,7 @@ scan_sharing_clauses (tree clauses, omp_context *ctx) case OMP_CLAUSE_FINALIZE: case OMP_CLAUSE_FILTER: case OMP_CLAUSE__CONDTEMP_: + case OMP_CLAUSE_ALLOCATOR: break; case OMP_CLAUSE__CACHE_: @@ -3033,6 +3035,16 @@ scan_omp_simd_scan (gimple_stmt_iterator *gsi, gomp_for *stmt, maybe_lookup_ctx (new_stmt)->for_simd_scan_phase = true; } +/* Scan an OpenMP allocate directive. */ + +static void +scan_omp_allocate (gomp_allocate *stmt, omp_context *outer_ctx) +{ + omp_context *ctx; + ctx = new_omp_context (stmt, outer_ctx); + scan_sharing_clauses (gimple_omp_allocate_clauses (stmt), ctx); +} + /* Scan an OpenMP sections directive. */ static void @@ -4332,6 +4344,9 @@ scan_omp_1_stmt (gimple_stmt_iterator *gsi, bool *handled_ops_p, insert_decl_map (&ctx->cb, var, var); } break; +case GIMPLE_OMP_ALLOCATE: + scan_omp_allocate (as_a (stmt), ctx); + break; default: *handled_ops_p = false; break; @@ -8768,6 +8783,125 @@ lower_omp_single_simple (gomp_single *single_stmt, gimple_seq *pre_p) gimple_seq_add_stmt (pre_p, gimple_build_label (flabel)); } +static void +lower_omp_allocate (gimple_stmt_iterator *gsi_p, omp_context *ctx) +{ + gomp_allocate *st = as_a (gsi_stmt (*gsi_p)); + tree clauses = gimple_omp_allocate_clauses (st); + int kind = gimple_omp_allocate_kind (st); + gcc_assert (kind == GF_OMP_ALLOCATE_KIND_ALLOCATE + || kind == GF_OMP_ALLOCATE_KIND_FREE); + + for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) +{ + if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_ALLOCATOR) + continue; + + bool allocate = (kind == GF_OMP_ALLOCATE_KIND_ALLOCATE); + /* The allocate directives that appear in a target region must specify + an allocator clause unless a requires directive with the + dynamic_allocators clause is present in the same compilation unit. */ + if (OMP_ALLOCATE_ALLOCATOR (c) == NULL_TREE + && ((omp_requires_mask & OMP_REQUIRES_DYNAMIC_ALLOCATORS) == 0) + && omp_maybe_offloaded_ctx (ctx)) + error_at (OMP_CLAUSE_LOCATION (c), "% directive must" + " specify an allocator here"); + + tree var = OMP_ALLOCATE_DECL (c); + + gimple_stmt_iterator gsi = *gsi_p; + for (gsi_next (&gsi); !gsi_end_p (gsi); gsi_next (&gsi)) + { + gimple *stmt = gsi_stmt (gsi); + + if (gimple_code (stmt) != GIMPLE_CALL + || (allocate && gimple_call_fndecl (stmt) + != builtin_decl_explicit (BUILT_IN_MALLOC)) + || (!allocate && gimple_call_fndecl (stmt) + != builtin_decl_explicit (BUILT_IN_FREE))) + continue; + const gcall *gs = as_a (stmt); + tree allocator = OMP_ALLOCATE_ALLOCATOR (c) + ? OMP_ALLOCATE_ALLOCATOR (c) + : integer_zero_node; + if (allocate) + { + tree lhs = gimple_call_lhs (gs); + if (lhs && TREE_CODE (lhs) == SSA_NAME) + { + gimple_stmt_iterator gsi2 = gsi; + gsi_next (&gsi2); + gimple *assign = gsi_stmt (gsi2); + if (gimple_code (assign) == GIMPLE_ASSIGN) + { + lhs = gimple_assign_lhs (as_a (assign)); + if (lhs == NULL_TREE + || TREE_CODE (lhs) != COMPONENT_REF) + continue; + lhs = TREE_OPERAND (lhs, 0); + } + } + + if (lhs == var) + { + unsigned HOST_WIDE_INT ialign = 0; + tree align; + if (TYPE_P (var)) + ialign = TYPE_ALIGN_UNIT (var); + else + ialign = DECL_ALIGN_UNIT (var);
[PATCH 08/17] openmp: -foffload-memory=pinned
Implement the -foffload-memory=pinned option such that libgomp is instructed to enable fully-pinned memory at start-up. The option is intended to provide a performance boost to certain offload programs without modifying the code. This feature only works on Linux, at present, and simply calls mlockall to enable always-on memory pinning. It requires that the ulimit feature is set high enough to accommodate all the program's memory usage. In this mode the ompx_pinned_memory_alloc feature is disabled as it is not needed and may conflict. gcc/ChangeLog: * omp-builtins.def (BUILT_IN_GOMP_ENABLE_PINNED_MODE): New. * omp-low.cc (omp_enable_pinned_mode): New function. (execute_lower_omp): Call omp_enable_pinned_mode. libgomp/ChangeLog: * config/linux/allocator.c (always_pinned_mode): New variable. (GOMP_enable_pinned_mode): New function. (linux_memspace_alloc): Disable pinning when always_pinned_mode set. (linux_memspace_calloc): Likewise. (linux_memspace_free): Likewise. (linux_memspace_realloc): Likewise. * libgomp.map: Add GOMP_enable_pinned_mode. * testsuite/libgomp.c/alloc-pinned-7.c: New test. gcc/testsuite/ChangeLog: * c-c++-common/gomp/alloc-pinned-1.c: New test. --- gcc/omp-builtins.def | 3 + gcc/omp-low.cc| 66 +++ .../c-c++-common/gomp/alloc-pinned-1.c| 28 libgomp/config/linux/allocator.c | 26 libgomp/libgomp.map | 1 + libgomp/target.c | 4 +- libgomp/testsuite/libgomp.c/alloc-pinned-7.c | 63 ++ 7 files changed, 190 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-7.c diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def index ee5213eedcf..276dd7812f2 100644 --- a/gcc/omp-builtins.def +++ b/gcc/omp-builtins.def @@ -470,3 +470,6 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_WARNING, "GOMP_warning", BT_FN_VOID_CONST_PTR_SIZE, ATTR_NOTHROW_LEAF_LIST) DEF_GOMP_BUILTIN (BUILT_IN_GOMP_ERROR, "GOMP_error", BT_FN_VOID_CONST_PTR_SIZE, ATTR_COLD_NORETURN_NOTHROW_LEAF_LIST) +DEF_GOMP_BUILTIN (BUILT_IN_GOMP_ENABLE_PINNED_MODE, + "GOMP_enable_pinned_mode", + BT_FN_VOID, ATTR_NOTHROW_LIST) diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc index d73c165f029..ba612e5c67d 100644 --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -14620,6 +14620,68 @@ lower_omp (gimple_seq *body, omp_context *ctx) input_location = saved_location; } +/* Emit a constructor function to enable -foffload-memory=pinned + at runtime. Libgomp handles the OS mode setting, but we need to trigger + it by calling GOMP_enable_pinned mode before the program proper runs. */ + +static void +omp_enable_pinned_mode () +{ + static bool visited = false; + if (visited) +return; + visited = true; + + /* Create a new function like this: + + static void __attribute__((constructor)) + __set_pinned_mode () + { + GOMP_enable_pinned_mode (); + } + */ + + tree name = get_identifier ("__set_pinned_mode"); + tree voidfntype = build_function_type_list (void_type_node, NULL_TREE); + tree decl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL, name, voidfntype); + + TREE_STATIC (decl) = 1; + TREE_USED (decl) = 1; + DECL_ARTIFICIAL (decl) = 1; + DECL_IGNORED_P (decl) = 0; + TREE_PUBLIC (decl) = 0; + DECL_UNINLINABLE (decl) = 1; + DECL_EXTERNAL (decl) = 0; + DECL_CONTEXT (decl) = NULL_TREE; + DECL_INITIAL (decl) = make_node (BLOCK); + BLOCK_SUPERCONTEXT (DECL_INITIAL (decl)) = decl; + DECL_STATIC_CONSTRUCTOR (decl) = 1; + DECL_ATTRIBUTES (decl) = tree_cons (get_identifier ("constructor"), + NULL_TREE, NULL_TREE); + + tree t = build_decl (UNKNOWN_LOCATION, RESULT_DECL, NULL_TREE, + void_type_node); + DECL_ARTIFICIAL (t) = 1; + DECL_IGNORED_P (t) = 1; + DECL_CONTEXT (t) = decl; + DECL_RESULT (decl) = t; + + push_struct_function (decl); + init_tree_ssa (cfun); + + tree calldecl = builtin_decl_explicit (BUILT_IN_GOMP_ENABLE_PINNED_MODE); + gcall *call = gimple_build_call (calldecl, 0); + + gimple_seq seq = NULL; + gimple_seq_add_stmt (&seq, call); + gimple_set_body (decl, gimple_build_bind (NULL_TREE, seq, NULL)); + + cfun->function_end_locus = UNKNOWN_LOCATION; + cfun->curr_properties |= PROP_gimple_any; + pop_cfun (); + cgraph_node::add_new_function (decl, true); +} + /* Main entry point. */ static unsigned int @@ -14676,6 +14738,10 @@ execute_lower_omp (void) for (auto task_stmt : task_cpyfns) finalize_task_copyfn (task_stmt); task_cpyfns.release (); + + if (flag_offload_memory == OFFLOAD_MEMORY_PINNED) +omp_enable_pinned_mode (); + return 0; } diff --git a/gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c b/gcc/testsuite/c-c++-common/gomp/alloc-pi
[PATCH 13/17] Gimplify allocate directive (OpenMP 5.0).
gcc/ChangeLog: * doc/gimple.texi: Describe GIMPLE_OMP_ALLOCATE. * gimple-pretty-print.cc (dump_gimple_omp_allocate): New function. (pp_gimple_stmt_1): Call it. * gimple.cc (gimple_build_omp_allocate): New function. * gimple.def (GIMPLE_OMP_ALLOCATE): New node. * gimple.h (enum gf_mask): Add GF_OMP_ALLOCATE_KIND_MASK, GF_OMP_ALLOCATE_KIND_ALLOCATE and GF_OMP_ALLOCATE_KIND_FREE. (struct gomp_allocate): New. (is_a_helper ::test): New. (is_a_helper ::test): New. (gimple_build_omp_allocate): Declare. (gimple_omp_subcode): Replace GIMPLE_OMP_TEAMS with GIMPLE_OMP_ALLOCATE. (gimple_omp_allocate_set_clauses): New. (gimple_omp_allocate_set_kind): Likewise. (gimple_omp_allocate_clauses): Likewise. (gimple_omp_allocate_kind): Likewise. (CASE_GIMPLE_OMP): Add GIMPLE_OMP_ALLOCATE. * gimplify.cc (gimplify_omp_allocate): New. (gimplify_expr): Call it. * gsstruct.def (GSS_OMP_ALLOCATE): Define. gcc/testsuite/ChangeLog: * gfortran.dg/gomp/allocate-6.f90: Add tests. --- gcc/doc/gimple.texi | 38 +++- gcc/gimple-pretty-print.cc| 37 gcc/gimple.cc | 12 gcc/gimple.def| 6 ++ gcc/gimple.h | 60 ++- gcc/gimplify.cc | 19 ++ gcc/gsstruct.def | 1 + gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 | 4 +- 8 files changed, 173 insertions(+), 4 deletions(-) diff --git a/gcc/doc/gimple.texi b/gcc/doc/gimple.texi index dd9149377f3..67b9061f3a7 100644 --- a/gcc/doc/gimple.texi +++ b/gcc/doc/gimple.texi @@ -420,6 +420,9 @@ kinds, along with their relationships to @code{GSS_} values (layouts) and + gomp_continue |layout: GSS_OMP_CONTINUE, code: GIMPLE_OMP_CONTINUE | + + gomp_allocate + |layout: GSS_OMP_ALLOCATE, code: GIMPLE_OMP_ALLOCATE + | + gomp_atomic_load |layout: GSS_OMP_ATOMIC_LOAD, code: GIMPLE_OMP_ATOMIC_LOAD | @@ -454,6 +457,7 @@ The following table briefly describes the GIMPLE instruction set. @item @code{GIMPLE_GOTO} @tab x @tab x @item @code{GIMPLE_LABEL} @tab x @tab x @item @code{GIMPLE_NOP} @tab x @tab x +@item @code{GIMPLE_OMP_ALLOCATE} @tab x @tab x @item @code{GIMPLE_OMP_ATOMIC_LOAD} @tab x @tab x @item @code{GIMPLE_OMP_ATOMIC_STORE} @tab x @tab x @item @code{GIMPLE_OMP_CONTINUE} @tab x @tab x @@ -1029,6 +1033,7 @@ Return a deep copy of statement @code{STMT}. * @code{GIMPLE_LABEL}:: * @code{GIMPLE_GOTO}:: * @code{GIMPLE_NOP}:: +* @code{GIMPLE_OMP_ALLOCATE}:: * @code{GIMPLE_OMP_ATOMIC_LOAD}:: * @code{GIMPLE_OMP_ATOMIC_STORE}:: * @code{GIMPLE_OMP_CONTINUE}:: @@ -1729,6 +1734,38 @@ Build a @code{GIMPLE_NOP} statement. Returns @code{TRUE} if statement @code{G} is a @code{GIMPLE_NOP}. @end deftypefn +@node @code{GIMPLE_OMP_ALLOCATE} +@subsection @code{GIMPLE_OMP_ALLOCATE} +@cindex @code{GIMPLE_OMP_ALLOCATE} + +@deftypefn {GIMPLE function} gomp_allocate *gimple_build_omp_allocate ( @ +tree clauses, int kind) +Build a @code{GIMPLE_OMP_ALLOCATE} statement. @code{CLAUSES} is the clauses +associated with this node. @code{KIND} is the enumeration value +@code{GF_OMP_ALLOCATE_KIND_ALLOCATE} if this directive allocates memory +or @code{GF_OMP_ALLOCATE_KIND_FREE} if it de-allocates. +@end deftypefn + +@deftypefn {GIMPLE function} void gimple_omp_allocate_set_clauses ( @ +gomp_allocate *g, tree clauses) +Set the @code{CLAUSES} for a @code{GIMPLE_OMP_ALLOCATE}. +@end deftypefn + +@deftypefn {GIMPLE function} tree gimple_omp_aallocate_clauses ( @ +const gomp_allocate *g) +Get the @code{CLAUSES} of a @code{GIMPLE_OMP_ALLOCATE}. +@end deftypefn + +@deftypefn {GIMPLE function} void gimple_omp_allocate_set_kind ( @ +gomp_allocate *g, int kind) +Set the @code{KIND} for a @code{GIMPLE_OMP_ALLOCATE}. +@end deftypefn + +@deftypefn {GIMPLE function} tree gimple_omp_allocate_kind ( @ +const gomp_atomic_load *g) +Get the @code{KIND} of a @code{GIMPLE_OMP_ALLOCATE}. +@end deftypefn + @node @code{GIMPLE_OMP_ATOMIC_LOAD} @subsection @code{GIMPLE_OMP_ATOMIC_LOAD} @cindex @code{GIMPLE_OMP_ATOMIC_LOAD} @@ -1760,7 +1797,6 @@ const gomp_atomic_load *g) Get the @code{RHS} of an atomic set. @end deftypefn - @node @code{GIMPLE_OMP_ATOMIC_STORE} @subsection @code{GIMPLE_OMP_ATOMIC_STORE} @cindex @code{GIMPLE_OMP_ATOMIC_STORE} diff --git a/gcc/gimple-pretty-print.cc b/gcc/gimple-pretty-print.cc index ebd87b20a0a..bb961a900df 100644 --- a/gcc/gimple-pretty-print.cc +++ b/gcc/gimple-pretty-print.cc @@ -1967,6 +1967,38 @@ dump_gimple_omp_critical (pretty_printer *buffer, const gomp_critical *gs, } } +static void +dump_gimple_omp_allocate (pretty_printer *buffer, const gomp_allocate *gs, + int spc, dump_flags_t fl
[PATCH 10/17] Add parsing support for allocate directive (OpenMP 5.0)
Currently we only make use of this directive when it is associated with an allocate statement. gcc/fortran/ChangeLog: * dump-parse-tree.cc (show_omp_node): Handle EXEC_OMP_ALLOCATE. (show_code_node): Likewise. * gfortran.h (enum gfc_statement): Add ST_OMP_ALLOCATE. (OMP_LIST_ALLOCATOR): New enum value. (enum gfc_exec_op): Add EXEC_OMP_ALLOCATE. * match.h (gfc_match_omp_allocate): New function. * openmp.cc (enum omp_mask1): Add OMP_CLAUSE_ALLOCATOR. (OMP_ALLOCATE_CLAUSES): New define. (gfc_match_omp_allocate): New function. (resolve_omp_clauses): Add ALLOCATOR in clause_names. (omp_code_to_statement): Handle EXEC_OMP_ALLOCATE. (EMPTY_VAR_LIST): New define. (check_allocate_directive_restrictions): New function. (gfc_resolve_omp_allocate): Likewise. (gfc_resolve_omp_directive): Handle EXEC_OMP_ALLOCATE. * parse.cc (decode_omp_directive): Handle ST_OMP_ALLOCATE. (next_statement): Likewise. (gfc_ascii_statement): Likewise. * resolve.cc (gfc_resolve_code): Handle EXEC_OMP_ALLOCATE. * st.cc (gfc_free_statement): Likewise. * trans.cc (trans_code): Likewise --- gcc/fortran/dump-parse-tree.cc| 3 + gcc/fortran/gfortran.h| 4 +- gcc/fortran/match.h | 1 + gcc/fortran/openmp.cc | 199 +- gcc/fortran/parse.cc | 10 +- gcc/fortran/resolve.cc| 1 + gcc/fortran/st.cc | 1 + gcc/fortran/trans.cc | 1 + gcc/testsuite/gfortran.dg/gomp/allocate-4.f90 | 112 ++ gcc/testsuite/gfortran.dg/gomp/allocate-5.f90 | 73 +++ 10 files changed, 400 insertions(+), 5 deletions(-) create mode 100644 gcc/testsuite/gfortran.dg/gomp/allocate-4.f90 create mode 100644 gcc/testsuite/gfortran.dg/gomp/allocate-5.f90 diff --git a/gcc/fortran/dump-parse-tree.cc b/gcc/fortran/dump-parse-tree.cc index 5352008a63d..e0c6c0d9d96 100644 --- a/gcc/fortran/dump-parse-tree.cc +++ b/gcc/fortran/dump-parse-tree.cc @@ -2003,6 +2003,7 @@ show_omp_node (int level, gfc_code *c) case EXEC_OACC_CACHE: name = "CACHE"; is_oacc = true; break; case EXEC_OACC_ENTER_DATA: name = "ENTER DATA"; is_oacc = true; break; case EXEC_OACC_EXIT_DATA: name = "EXIT DATA"; is_oacc = true; break; +case EXEC_OMP_ALLOCATE: name = "ALLOCATE"; break; case EXEC_OMP_ATOMIC: name = "ATOMIC"; break; case EXEC_OMP_BARRIER: name = "BARRIER"; break; case EXEC_OMP_CANCEL: name = "CANCEL"; break; @@ -2204,6 +2205,7 @@ show_omp_node (int level, gfc_code *c) || c->op == EXEC_OMP_TARGET_UPDATE || c->op == EXEC_OMP_TARGET_ENTER_DATA || c->op == EXEC_OMP_TARGET_EXIT_DATA || c->op == EXEC_OMP_SCAN || c->op == EXEC_OMP_DEPOBJ || c->op == EXEC_OMP_ERROR + || c->op == EXEC_OMP_ALLOCATE || (c->op == EXEC_OMP_ORDERED && c->block == NULL)) return; if (c->op == EXEC_OMP_SECTIONS || c->op == EXEC_OMP_PARALLEL_SECTIONS) @@ -3329,6 +3331,7 @@ show_code_node (int level, gfc_code *c) case EXEC_OACC_CACHE: case EXEC_OACC_ENTER_DATA: case EXEC_OACC_EXIT_DATA: +case EXEC_OMP_ALLOCATE: case EXEC_OMP_ATOMIC: case EXEC_OMP_CANCEL: case EXEC_OMP_CANCELLATION_POINT: diff --git a/gcc/fortran/gfortran.h b/gcc/fortran/gfortran.h index 696aadd7db6..755469185a6 100644 --- a/gcc/fortran/gfortran.h +++ b/gcc/fortran/gfortran.h @@ -259,7 +259,7 @@ enum gfc_statement ST_OACC_CACHE, ST_OACC_KERNELS_LOOP, ST_OACC_END_KERNELS_LOOP, ST_OACC_SERIAL_LOOP, ST_OACC_END_SERIAL_LOOP, ST_OACC_SERIAL, ST_OACC_END_SERIAL, ST_OACC_ENTER_DATA, ST_OACC_EXIT_DATA, ST_OACC_ROUTINE, - ST_OACC_ATOMIC, ST_OACC_END_ATOMIC, + ST_OACC_ATOMIC, ST_OACC_END_ATOMIC, ST_OMP_ALLOCATE, ST_OMP_ATOMIC, ST_OMP_BARRIER, ST_OMP_CRITICAL, ST_OMP_END_ATOMIC, ST_OMP_END_CRITICAL, ST_OMP_END_DO, ST_OMP_END_MASTER, ST_OMP_END_ORDERED, ST_OMP_END_PARALLEL, ST_OMP_END_PARALLEL_DO, ST_OMP_END_PARALLEL_SECTIONS, @@ -1398,6 +1398,7 @@ enum OMP_LIST_USE_DEVICE_ADDR, OMP_LIST_NONTEMPORAL, OMP_LIST_ALLOCATE, + OMP_LIST_ALLOCATOR, OMP_LIST_HAS_DEVICE_ADDR, OMP_LIST_ENTER, OMP_LIST_NUM /* Must be the last. */ @@ -2908,6 +2909,7 @@ enum gfc_exec_op EXEC_OACC_DATA, EXEC_OACC_HOST_DATA, EXEC_OACC_LOOP, EXEC_OACC_UPDATE, EXEC_OACC_WAIT, EXEC_OACC_CACHE, EXEC_OACC_ENTER_DATA, EXEC_OACC_EXIT_DATA, EXEC_OACC_ATOMIC, EXEC_OACC_DECLARE, + EXEC_OMP_ALLOCATE, EXEC_OMP_CRITICAL, EXEC_OMP_DO, EXEC_OMP_FLUSH, EXEC_OMP_MASTER, EXEC_OMP_ORDERED, EXEC_OMP_PARALLEL, EXEC_OMP_PARALLEL_DO, EXEC_OMP_PARALLEL_SECTIONS, EXEC_OMP_PARALLEL_WORKSHARE, diff --git a/gcc/fortran/match.h b/gcc/fortran/match.h index 495c93e0b5c..fe43d4b3fd3 100644 --- a/gcc/fortran/match.h +++ b/gcc/fortran/match.h @@ -149,6 +149,7 @@ match gfc_match_oacc_routine (
[PATCH 17/17] amdgcn: libgomp plugin USM implementation
Implement the Unified Shared Memory API calls in the GCN plugin. The allocate and free are pretty straight-forward because all "target" memory allocations are compatible with USM, on the right hardware. However, there's no known way to check what memory region was used, after the fact, so we use a splay tree to record allocations so we can answer "is_usm_ptr" later. libgomp/ChangeLog: * plugin/plugin-gcn.c (GOMP_OFFLOAD_get_num_devices): Allow GOMP_REQUIRES_UNIFIED_ADDRESS and GOMP_REQUIRES_UNIFIED_SHARED_MEMORY. (struct usm_splay_tree_key_s): New. (usm_splay_compare): New. (splay_tree_prefix): New. (GOMP_OFFLOAD_usm_alloc): New. (GOMP_OFFLOAD_usm_free): New. (GOMP_OFFLOAD_is_usm_ptr): New. (GOMP_OFFLOAD_supported_features): Move into the OpenMP API fold. Add GOMP_REQUIRES_UNIFIED_ADDRESS and GOMP_REQUIRES_UNIFIED_SHARED_MEMORY. (gomp_fatal): New. (splay_tree_c): New. * testsuite/lib/libgomp.exp (check_effective_target_omp_usm): New. * testsuite/libgomp.c++/usm-1.C: Use dg-require-effective-target. * testsuite/libgomp.c-c++-common/requires-1.c: Likewise. * testsuite/libgomp.c/usm-1.c: Likewise. * testsuite/libgomp.c/usm-2.c: Likewise. * testsuite/libgomp.c/usm-3.c: Likewise. * testsuite/libgomp.c/usm-4.c: Likewise. * testsuite/libgomp.c/usm-5.c: Likewise. * testsuite/libgomp.c/usm-6.c: Likewise. --- libgomp/plugin/plugin-gcn.c | 104 +- libgomp/testsuite/lib/libgomp.exp | 22 libgomp/testsuite/libgomp.c++/usm-1.C | 2 +- .../libgomp.c-c++-common/requires-1.c | 1 + libgomp/testsuite/libgomp.c/usm-1.c | 1 + libgomp/testsuite/libgomp.c/usm-2.c | 1 + libgomp/testsuite/libgomp.c/usm-3.c | 1 + libgomp/testsuite/libgomp.c/usm-4.c | 1 + libgomp/testsuite/libgomp.c/usm-5.c | 2 +- libgomp/testsuite/libgomp.c/usm-6.c | 2 +- 10 files changed, 133 insertions(+), 4 deletions(-) diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c index ea327bf2ca0..6a9ff5cd93e 100644 --- a/libgomp/plugin/plugin-gcn.c +++ b/libgomp/plugin/plugin-gcn.c @@ -3226,7 +3226,11 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask) if (!init_hsa_context ()) return 0; /* Return -1 if no omp_requires_mask cannot be fulfilled but - devices were present. */ + devices were present. + Note: not all devices support USM, but the compiler refuses to create + binaries for those that don't anyway. */ + omp_requires_mask &= ~(GOMP_REQUIRES_UNIFIED_ADDRESS + | GOMP_REQUIRES_UNIFIED_SHARED_MEMORY); if (hsa_context.agent_count > 0 && omp_requires_mask != 0) return -1; return hsa_context.agent_count; @@ -3810,6 +3814,89 @@ GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void *tgt_vars, GOMP_PLUGIN_target_task_completion, async_data); } +/* Use a splay tree to track USM allocations. */ + +typedef struct usm_splay_tree_node_s *usm_splay_tree_node; +typedef struct usm_splay_tree_s *usm_splay_tree; +typedef struct usm_splay_tree_key_s *usm_splay_tree_key; + +struct usm_splay_tree_key_s { + void *addr; + size_t size; +}; + +static inline int +usm_splay_compare (usm_splay_tree_key x, usm_splay_tree_key y) +{ + if ((x->addr <= y->addr && x->addr + x->size > y->addr) + || (y->addr <= x->addr && y->addr + y->size > x->addr)) +return 0; + + return (x->addr > y->addr ? 1 : -1); +} + +#define splay_tree_prefix usm +#include "../splay-tree.h" + +static struct usm_splay_tree_s usm_map = { NULL }; + +/* Allocate memory suitable for Unified Shared Memory. + + In fact, AMD memory need only be "coarse grained", which target + allocations already are. We do need to track allocations so that + GOMP_OFFLOAD_is_usm_ptr can look them up. */ + +void * +GOMP_OFFLOAD_usm_alloc (int device, size_t size) +{ + void *ptr = GOMP_OFFLOAD_alloc (device, size); + + usm_splay_tree_node node = malloc (sizeof (struct usm_splay_tree_node_s)); + node->key.addr = ptr; + node->key.size = size; + node->left = NULL; + node->right = NULL; + usm_splay_tree_insert (&usm_map, node); + + return ptr; +} + +/* Free memory allocated via GOMP_OFFLOAD_usm_alloc. */ + +bool +GOMP_OFFLOAD_usm_free (int device, void *ptr) +{ + struct usm_splay_tree_key_s key = { ptr, 1 }; + usm_splay_tree_key node = usm_splay_tree_lookup (&usm_map, &key); + if (node) +{ + usm_splay_tree_remove (&usm_map, &key); + free (node); +} + + return GOMP_OFFLOAD_free (device, ptr); +} + +/* True if the memory was allocated via GOMP_OFFLOAD_usm_alloc. */ + +bool +GOMP_OFFLOAD_is_usm_ptr (void *ptr) +{ + struct usm_splay_tree_key_s key = { ptr, 1 }; + return usm_splay_tree_lookup (&usm_map, &key); +} + +/* Indicate which GOMP_REQUIRES_* features are supported. */ + +bool +GO
[PATCH 15/17] amdgcn: Support XNACK mode
The XNACK feature allows memory load instructions to restart safely following a page-miss interrupt. This is useful for shared-memory devices, like APUs, and to implement OpenMP Unified Shared Memory. To support the feature we must be able to set the appropriate meta-data and set the load instructions to early-clobber. When the port supports scheduling of s_waitcnt instructions there will be further requirements. gcc/ChangeLog: * config/gcn/gcn-hsa.h (XNACKOPT): New macro. (ASM_SPEC): Use XNACKOPT. * config/gcn/gcn-opts.h (enum sram_ecc_type): Rename to ... (enum hsaco_attr_type): ... this, and generalize the names. (TARGET_XNACK): New macro. * config/gcn/gcn-valu.md (gather_insn_1offset): Add xnack compatible alternatives. (gather_insn_2offsets): Likewise. * config/gcn/gcn.c (gcn_option_override): Permit -mxnack for devices other than Fiji. (gcn_expand_epilogue): Remove early-clobber problems. (output_file_start): Emit xnack attributes. (gcn_hsa_declare_function_name): Obey -mxnack setting. * config/gcn/gcn.md (xnack): New attribute. (enabled): Rework to include "xnack" attribute. (*movbi): Add xnack compatible alternatives. (*mov_insn): Likewise. (*mov_insn): Likewise. (*mov_insn): Likewise. (*movti_insn): Likewise. * config/gcn/gcn.opt (-mxnack): Add the "on/off/any" syntax. (sram_ecc_type): Rename to ... (hsaco_attr_type: ... this.) * config/gcn/mkoffload.c (SET_XNACK_ANY): New macro. (TEST_XNACK): Delete. (TEST_XNACK_ANY): New macro. (TEST_XNACK_ON): New macro. (main): Support the new -mxnack=on/off/any syntax. --- gcc/config/gcn/gcn-hsa.h| 3 +- gcc/config/gcn/gcn-opts.h | 10 ++-- gcc/config/gcn/gcn-valu.md | 29 - gcc/config/gcn/gcn.cc | 34 ++- gcc/config/gcn/gcn.md | 113 +++- gcc/config/gcn/gcn.opt | 18 +++--- gcc/config/gcn/mkoffload.cc | 19 -- 7 files changed, 140 insertions(+), 86 deletions(-) diff --git a/gcc/config/gcn/gcn-hsa.h b/gcc/config/gcn/gcn-hsa.h index b3079cebb43..fd08947574f 100644 --- a/gcc/config/gcn/gcn-hsa.h +++ b/gcc/config/gcn/gcn-hsa.h @@ -81,12 +81,13 @@ extern unsigned int gcn_local_sym_hash (const char *name); /* In HSACOv4 no attribute setting means the binary supports "any" hardware configuration. The name of the attribute also changed. */ #define SRAMOPT "msram-ecc=on:-mattr=+sramecc;msram-ecc=off:-mattr=-sramecc" +#define XNACKOPT "mxnack=on:-mattr=+xnack;mxnack=off:-mattr=-xnack" /* Use LLVM assembler and linker options. */ #define ASM_SPEC "-triple=amdgcn--amdhsa " \ "%:last_arg(%{march=*:-mcpu=%*}) " \ "%{!march=*|march=fiji:--amdhsa-code-object-version=3} " \ - "%{" NO_XNACK "mxnack:-mattr=+xnack;:-mattr=-xnack} " \ + "%{" NO_XNACK XNACKOPT "}" \ "%{" NO_SRAM_ECC SRAMOPT "} " \ "-filetype=obj" #define LINK_SPEC "--pie --export-dynamic" diff --git a/gcc/config/gcn/gcn-opts.h b/gcc/config/gcn/gcn-opts.h index b62dfb45f59..07ddc79cda3 100644 --- a/gcc/config/gcn/gcn-opts.h +++ b/gcc/config/gcn/gcn-opts.h @@ -48,11 +48,13 @@ extern enum gcn_isa { #define TARGET_M0_LDS_LIMIT (TARGET_GCN3) #define TARGET_PACKED_WORK_ITEMS (TARGET_CDNA2_PLUS) -enum sram_ecc_type +#define TARGET_XNACK (flag_xnack != HSACO_ATTR_OFF) + +enum hsaco_attr_type { - SRAM_ECC_OFF, - SRAM_ECC_ON, - SRAM_ECC_ANY + HSACO_ATTR_OFF, + HSACO_ATTR_ON, + HSACO_ATTR_ANY }; #endif diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md index abe46201344..ec114db9dd1 100644 --- a/gcc/config/gcn/gcn-valu.md +++ b/gcc/config/gcn/gcn-valu.md @@ -741,13 +741,13 @@ (define_expand "gather_expr" {}) (define_insn "gather_insn_1offset" - [(set (match_operand:V_ALL 0 "register_operand" "=v") + [(set (match_operand:V_ALL 0 "register_operand" "=v,&v") (unspec:V_ALL - [(plus: (match_operand: 1 "register_operand" " v") + [(plus: (match_operand: 1 "register_operand" " v, v") (vec_duplicate: - (match_operand 2 "immediate_operand" " n"))) - (match_operand 3 "immediate_operand" " n") - (match_operand 4 "immediate_operand" " n") + (match_operand 2 "immediate_operand" " n, n"))) + (match_operand 3 "immediate_operand" " n, n") + (match_operand 4 "immediate_operand" " n, n") (mem:BLK (scratch))] UNSPEC_GATHER))] "(AS_FLAT_P (INTVAL (operands[3])) @@ -777,7 +777,8 @@ (define_insn "gather_insn_1offset" return buf; } [(set_attr "type" "flat") - (set_attr "length" "12")]) + (set_attr "length" "12") + (set_attr "xnack" "off,on")]) (define_insn "gather_insn_1offset_ds" [(set (match_operand:V_ALL 0 "register_operand" "=v") @@ -802,17 +803,18 @@ (define_insn "gather_insn_1offset_ds" (set_attr "length" "12")]) (define_insn "gather_insn_2o
[PATCH 16/17] amdgcn, openmp: Auto-detect USM mode and set HSA_XNACK
The AMD GCN runtime must be set to the correct mode for Unified Shared Memory to work, but this is not always clear at compile and link time due to the split nature of the offload compilation pipeline. This patch sets a new attribute on OpenMP offload functions to ensure that the information is passed all the way to the backend. The backend then places a marker in the assembler code for mkoffload to find. Finally mkoffload places a constructor function into the final program to ensure that the HSA_XNACK environment variable passes the correct mode to the GPU. The HSA_XNACK variable must be set before the HSA runtime is even loaded, so it makes more sense to have this set within the constructor than at some point later within libgomp or the GCN plugin. gcc/ChangeLog: * config/gcn/gcn.c (unified_shared_memory_enabled): New variable. (gcn_init_cumulative_args): Handle attribute "omp unified memory". (gcn_hsa_declare_function_name): Emit "MKOFFLOAD OPTIONS: USM+". * config/gcn/mkoffload.c (TEST_XNACK_OFF): New macro. (process_asm): Detect "MKOFFLOAD OPTIONS: USM+". Emit configure_xnack constructor, as required. * omp-low.c (create_omp_child_function): Add attribute "omp unified memory". --- gcc/config/gcn/gcn.cc | 28 +++- gcc/config/gcn/mkoffload.cc | 37 - gcc/omp-low.cc | 4 3 files changed, 67 insertions(+), 2 deletions(-) diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc index 4df05453604..88cc505597e 100644 --- a/gcc/config/gcn/gcn.cc +++ b/gcc/config/gcn/gcn.cc @@ -68,6 +68,11 @@ static bool ext_gcn_constants_init = 0; enum gcn_isa gcn_isa = ISA_GCN3; /* Default to GCN3. */ +/* Record whether the host compiler added "omp unifed memory" attributes to + any functions. We can then pass this on to mkoffload to ensure xnack is + compatible there too. */ +static bool unified_shared_memory_enabled = false; + /* Reserve this much space for LDS (for propagating variables from worker-single mode to worker-partitioned mode), per workgroup. Global analysis could calculate an exact bound, but we don't do that yet. @@ -2542,6 +2547,25 @@ gcn_init_cumulative_args (CUMULATIVE_ARGS *cum /* Argument info to init */ , if (!caller && cfun->machine->normal_function) gcn_detect_incoming_pointer_arg (fndecl); + if (fndecl && lookup_attribute ("omp unified memory", + DECL_ATTRIBUTES (fndecl))) +{ + unified_shared_memory_enabled = true; + + switch (gcn_arch) + { + case PROCESSOR_FIJI: + case PROCESSOR_VEGA10: + case PROCESSOR_VEGA20: + error ("GPU architecture does not support Unified Shared Memory"); + default: + ; + } + + if (flag_xnack == HSACO_ATTR_OFF) + error ("Unified Shared Memory is enabled, but XNACK is disabled"); +} + reinit_regs (); } @@ -5458,12 +5482,14 @@ gcn_hsa_declare_function_name (FILE *file, const char *name, tree) assemble_name (file, name); fputs (":\n", file); - /* This comment is read by mkoffload. */ + /* These comments are read by mkoffload. */ if (flag_openacc) fprintf (file, "\t;; OPENACC-DIMS: %d, %d, %d : %s\n", oacc_get_fn_dim_size (cfun->decl, GOMP_DIM_GANG), oacc_get_fn_dim_size (cfun->decl, GOMP_DIM_WORKER), oacc_get_fn_dim_size (cfun->decl, GOMP_DIM_VECTOR), name); + if (unified_shared_memory_enabled) +fprintf (asm_out_file, "\t;; MKOFFLOAD OPTIONS: USM+\n"); } /* Implement TARGET_ASM_SELECT_SECTION. diff --git a/gcc/config/gcn/mkoffload.cc b/gcc/config/gcn/mkoffload.cc index cb8903c27cb..5741d0a917b 100644 --- a/gcc/config/gcn/mkoffload.cc +++ b/gcc/config/gcn/mkoffload.cc @@ -80,6 +80,8 @@ == EF_AMDGPU_FEATURE_XNACK_ANY_V4) #define TEST_XNACK_ON(VAR) ((VAR & EF_AMDGPU_FEATURE_XNACK_V4) \ == EF_AMDGPU_FEATURE_XNACK_ON_V4) +#define TEST_XNACK_OFF(VAR) ((VAR & EF_AMDGPU_FEATURE_XNACK_V4) \ + == EF_AMDGPU_FEATURE_XNACK_OFF_V4) #define SET_SRAM_ECC_ON(VAR) VAR = ((VAR & ~EF_AMDGPU_FEATURE_SRAMECC_V4) \ | EF_AMDGPU_FEATURE_SRAMECC_ON_V4) @@ -474,6 +476,7 @@ static void process_asm (FILE *in, FILE *out, FILE *cfile) { int fn_count = 0, var_count = 0, dims_count = 0, regcount_count = 0; + bool unified_shared_memory_enabled = false; struct obstack fns_os, dims_os, regcounts_os; obstack_init (&fns_os); obstack_init (&dims_os); @@ -498,6 +501,7 @@ process_asm (FILE *in, FILE *out, FILE *cfile) fn_count += 2; char buf[1000]; + char dummy; enum { IN_CODE, IN_METADATA, @@ -517,6 +521,9 @@ process_asm (FILE *in, FILE *out, FILE *cfile) dims_count++; } + if (sscanf (buf, " ;; MKOFFLOAD OPTIONS: USM+%c", &dummy) > 0) + unified_shared_memory_enabled = true; + break; } case IN_METADATA: @@ -565,7 +572,6 @@ process_asm (FILE *in, FILE *out, FILE *cfile) } } - char dummy; if (sscanf (buf, " .section
Re: [PATCH 08/17] openmp: -foffload-memory=pinned
On 07/07/2022 12:54, Tobias Burnus wrote: Hi Andrew, On 07.07.22 12:34, Andrew Stubbs wrote: Implement the -foffload-memory=pinned option such that libgomp is instructed to enable fully-pinned memory at start-up. The option is intended to provide a performance boost to certain offload programs without modifying the code. ... gcc/ChangeLog: * omp-builtins.def (BUILT_IN_GOMP_ENABLE_PINNED_MODE): New. * omp-low.cc (omp_enable_pinned_mode): New function. (execute_lower_omp): Call omp_enable_pinned_mode. libgomp/ChangeLog: * config/linux/allocator.c (always_pinned_mode): New variable. (GOMP_enable_pinned_mode): New function. (linux_memspace_alloc): Disable pinning when always_pinned_mode set. (linux_memspace_calloc): Likewise. (linux_memspace_free): Likewise. (linux_memspace_realloc): Likewise. * libgomp.map: Add GOMP_enable_pinned_mode. * testsuite/libgomp.c/alloc-pinned-7.c: New test. ... ... --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -14620,6 +14620,68 @@ lower_omp (gimple_seq *body, omp_context *ctx) input_location = saved_location; } +/* Emit a constructor function to enable -foffload-memory=pinned + at runtime. Libgomp handles the OS mode setting, but we need to trigger + it by calling GOMP_enable_pinned mode before the program proper runs. */ + +static void +omp_enable_pinned_mode () Is there a reason not to use the mechanism of OpenMP's 'requires' directive for this? (Okay, I have to admit that the final patch was only committed on Monday. But still ...) Possibly, I had most of this done before then. I'll have a look next time I visit this patch. The Cuda-specific solution can't work this way anyway, because there's no mlockall equivalent, so I will make conditional adjustments anyway. Likewise, the 'requires' mechanism could then also be used in '[PATCH 16/17] amdgcn, openmp: Auto-detect USM mode and set HSA_XNACK'. No, I don't think so; that environment variable needs to be set before the libraries are loaded or it's too late. There are other ways to achieve the same thing, by leaving messages for the libgomp plugin to pick up, perhaps, but it's all extra complexity for no real gain. Andrew
Re: [PATCH 08/17] openmp: -foffload-memory=pinned
On 08/07/2022 10:00, Tobias Burnus wrote: On 08.07.22 00:18, Andrew Stubbs wrote: Likewise, the 'requires' mechanism could then also be used in '[PATCH 16/17] amdgcn, openmp: Auto-detect USM mode and set HSA_XNACK'. No, I don't think so; that environment variable needs to be set before the libraries are loaded or it's too late. There are other ways to achieve the same thing, by leaving messages for the libgomp plugin to pick up, perhaps, but it's all extra complexity for no real gain. I think we talk about two different things: (a) where (and when) to check/set the environment variable. I think this part is fine. You could consider moving the generated code for 'configure_xnack' code into the existing 'init' constructor function, but it does not really matter. (Nor does the order in which the constructor function runs.) (I also do not see any benefit of moving it to libgomp. The message could then be suppressed if no device available or similar tricky, but I do not see any real advantage of moving it.) Longer side note: I think the message "error: HSA_XNACK=%%s is incompatible; please unset" could be clearer. Both in terms who issues it and that it talks about an environment variable. Maybe: "|libgomp: fatal error: Environment variable HSA_XNACK=%s is incompatible with GCN offloading; please unset"| |or something like that. (I did misuse 'libgomp:' for this; I am not sure that makes sense or is even more misleading.) – I am also not sure GCN fits that well, given that CDNA is not GCN. But that is a general problem. But in any case, adding "fatal", "environment variable" and ... offloading makes surely sense, IMHO. It's not incompatible with GCN offloading, only with the XNACK mode in which the binary was compiled (i.e. USM on or off). The message could be less terse, indeed. I went through a variety of messages for this and couldn't find one that I liked. How about... fatal error: HSA_XNACK=%s is set but this program was compiled for HSA_XNACK=%s; please unset your environment variable. (b) How the value is made available inside both gcc/config/gcn/gcn.cc and in mkoffload.cc. I was talking about (b). Namely: omp_requires_mask is already available in gcc/config/gcn/gcn.cc and mkoffload.cc. Thus, there is no reason to reinvent the wheel and coming up with another means to pass the same kind of data to the very same files. (You still might want to add another flag to it (assuming 'omp requires unified_shared_memory' alias OMP_REQUIRES_UNIFIED_SHARED_MEMORY is insufficient.) OK, this is a new feature that I probably should investigate. Thanks Andrew
[PATCH] openmp: fix max_vf setting for amdgcn offloading
This patch ensures that the maximum vectorization factor used to set the "safelen" attribute on "omp simd" constructs is suitable for all the configured offload devices. Right now it makes the proper adjustment for NVPTX, but otherwise just uses a value suitable for the host system (always x86_64 in the case of amdgcn). This typically ends up being 16 where 64 is the minimum for vectorization to work properly on GCN. There is a potential problem that one "safelen" must be set for *all* offload devices, which means it can't be perfect for all devices. However I believe that too big is always OK (at least for powers of two?) whereas too small is not OK, so this code always selects the largest value of max_vf, regardless of where it comes from. The existing target VF function, omp_max_simt_vf, is tangled up with the notion of whether SIMT is available or not, so I couldn't add amdgcn in there. It's tempting to have omp_max_vf do some kind of autodetect what VF to choose, but the current implementation in omp-general.cc doesn't have access to the context in a convenient way, and nor do all the callers, so I couldn't easily do that. Instead, I have opted to add a new function, omp_max_simd_vf, which can check for the presence of amdgcn. While reviewing the callers of omp_max_vf I found one other case that looks like it ought to be tuned for the device, not just the host. In that case it's not clear how to achieve that and in fact, at least on x86_64, the way it is coded the actual value from omp_max_vf is always ignored in favour of a much larger "minimum", so I have added a comment for the next person to touch that spot and left it alone. This change gives a 10x performance improvement on the BabelStream "dot" benchmark on amdgcn and is not harmful on nvptx. OK for mainline? I will commit a backport to OG12 shortly. Andrewopenmp: fix max_vf setting for amdgcn offloading Ensure that the "max_vf" figure used for the "safelen" attribute is large enough for the largest configured offload device. This change gives ~10x speed improvement on the Bablestream "dot" benchmark for AMD GCN. gcc/ChangeLog: * gimple-loop-versioning.cc (loop_versioning::loop_versioning): Add comment. * omp-general.cc (omp_max_simd_vf): New function. * omp-general.h (omp_max_simd_vf): New prototype. * omp-low.cc (lower_rec_simd_input_clauses): Select largest from omp_max_vf, omp_max_simt_vf, and omp_max_simd_vf. gcc/testsuite/ChangeLog: * lib/target-supports.exp (check_effective_target_amdgcn_offloading_enabled): New. (check_effective_target_nvptx_offloading_enabled): New. * gcc.dg/gomp/target-vf.c: New test. diff --git a/gcc/gimple-loop-versioning.cc b/gcc/gimple-loop-versioning.cc index 6bcf6eba691..e908c27fc44 100644 --- a/gcc/gimple-loop-versioning.cc +++ b/gcc/gimple-loop-versioning.cc @@ -555,7 +555,10 @@ loop_versioning::loop_versioning (function *fn) unvectorizable code, since it is the largest size that can be handled efficiently by scalar code. omp_max_vf calculates the maximum number of bytes in a vector, when such a value is relevant - to loop optimization. */ + to loop optimization. + FIXME: this probably needs to use omp_max_simd_vf when in a target + region, but how to tell? (And MAX_FIXED_MODE_SIZE is large enough that + it doesn't actually matter.) */ m_maximum_scale = estimated_poly_value (omp_max_vf ()); m_maximum_scale = MAX (m_maximum_scale, MAX_FIXED_MODE_SIZE); } diff --git a/gcc/omp-general.cc b/gcc/omp-general.cc index a406c578f33..8c6fcebc4b3 100644 --- a/gcc/omp-general.cc +++ b/gcc/omp-general.cc @@ -994,6 +994,24 @@ omp_max_simt_vf (void) return 0; } +/* Return maximum SIMD width if offloading may target SIMD hardware. */ + +int +omp_max_simd_vf (void) +{ + if (!optimize) +return 0; + if (ENABLE_OFFLOADING) +for (const char *c = getenv ("OFFLOAD_TARGET_NAMES"); c;) + { + if (startswith (c, "amdgcn")) + return 64; + else if ((c = strchr (c, ':'))) + c++; + } + return 0; +} + /* Store the construct selectors as tree codes from last to first, return their number. */ diff --git a/gcc/omp-general.h b/gcc/omp-general.h index 74e90e1a71a..410343e45fa 100644 --- a/gcc/omp-general.h +++ b/gcc/omp-general.h @@ -104,6 +104,7 @@ extern gimple *omp_build_barrier (tree lhs); extern tree find_combined_omp_for (tree *, int *, void *); extern poly_uint64 omp_max_vf (void); extern int omp_max_simt_vf (void); +extern int omp_max_simd_vf (void); extern int omp_constructor_traits_to_codes (tree, enum tree_code *); extern tree omp_check_context_selector (location_t loc, tree ctx); extern void omp_mark_declare_variant (location_t loc, tree variant, diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc index d73c165f029..1a9a509adb9 100644 --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -4646,7 +4646,14 @@ lowe
[committed] amdgcn: 64-bit not
I've committed this patch to enable DImode one's-complement on amdgcn. The hardware doesn't have 64-bit not, and this isn't needed by expand which is happy to use two SImode operations, but the vectorizer isn't so clever. Vector condition masks are DImode on amdgcn, so this has been causing lots of conditional code to fail to vectorize. Andrewamdgcn: 64-bit not This makes the auto-vectorizer happier when handling masks. gcc/ChangeLog: * config/gcn/gcn.md (one_cmpldi2): New. diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md index 033c1708e88..70a769babc4 100644 --- a/gcc/config/gcn/gcn.md +++ b/gcc/config/gcn/gcn.md @@ -1676,6 +1676,26 @@ (define_expand "si3_scc" ;; }}} ;; {{{ ALU: generic 64-bit +(define_insn_and_split "one_cmpldi2" + [(set (match_operand:DI 0 "register_operand""=Sg,v") + (not:DI (match_operand:DI 1 "gcn_alu_operand" "SgA,vSvDB"))) + (clobber (match_scratch:BI 2 "=cs,X"))] + "" + "#" + "reload_completed" + [(parallel [(set (match_dup 3) (not:SI (match_dup 4))) + (clobber (match_dup 2))]) + (parallel [(set (match_dup 5) (not:SI (match_dup 6))) + (clobber (match_dup 2))])] + { +operands[3] = gcn_operand_part (DImode, operands[0], 0); +operands[4] = gcn_operand_part (DImode, operands[1], 0); +operands[5] = gcn_operand_part (DImode, operands[0], 1); +operands[6] = gcn_operand_part (DImode, operands[1], 1); + } + [(set_attr "type" "mult")] +) + (define_code_iterator vec_and_scalar64_com [and ior xor]) (define_insn_and_split "di3"
[committed] amdgcn: 64-bit vector shifts
I've committed this patch to implement V64DImode vector-vector and vector-scalar shifts. In particular, these are used by the SIMD "inbranch" clones that I'm working on right now, but it's an omission that ought to have been fixed anyway. Andrewamdgcn: 64-bit vector shifts Enable 64-bit vector-vector and vector-scalar shifts. gcc/ChangeLog: * config/gcn/gcn-valu.md (V_INT_noHI): New iterator. (3): Use V_INT_noHI. (v3): Likewise. diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md index abe46201344..8c33ae0c717 100644 --- a/gcc/config/gcn/gcn-valu.md +++ b/gcc/config/gcn/gcn-valu.md @@ -60,6 +60,8 @@ (define_mode_iterator V_noHI (define_mode_iterator V_INT_noQI [V64HI V64SI V64DI]) +(define_mode_iterator V_INT_noHI + [V64SI V64DI]) ; All of above (define_mode_iterator V_ALL @@ -2086,10 +2088,10 @@ (define_expand "3" }) (define_insn "3" - [(set (match_operand:V_SI 0 "register_operand" "= v") - (shiftop:V_SI - (match_operand:V_SI 1 "gcn_alu_operand" " v") - (vec_duplicate:V_SI + [(set (match_operand:V_INT_noHI 0 "register_operand" "= v") + (shiftop:V_INT_noHI + (match_operand:V_INT_noHI 1 "gcn_alu_operand" " v") + (vec_duplicate: (match_operand:SI 2 "gcn_alu_operand" "SvB"] "" "v_0\t%0, %2, %1" @@ -2117,10 +2119,10 @@ (define_expand "v3" }) (define_insn "v3" - [(set (match_operand:V_SI 0 "register_operand" "=v") - (shiftop:V_SI - (match_operand:V_SI 1 "gcn_alu_operand" " v") - (match_operand:V_SI 2 "gcn_alu_operand" "vB")))] + [(set (match_operand:V_INT_noHI 0 "register_operand" "=v") + (shiftop:V_INT_noHI + (match_operand:V_INT_noHI 1 "gcn_alu_operand" " v") + (match_operand: 2 "gcn_alu_operand" "vB")))] "" "v_0\t%0, %2, %1" [(set_attr "type" "vop2")
[PATCH] openmp-simd-clone: Match shift type
This patch adjusts the generation of SIMD "inbranch" clones that use integer masks to ensure that it vectorizes on amdgcn. The problem was only that an amdgcn mask is DImode and the shift amount was SImode, and the difference causes vectorization to fail. OK for mainline? Andrewopenmp-simd-clone: Match shift types Ensure that both parameters to vector shifts use the same mode. This is most important for amdgcn where the masks are DImode. gcc/ChangeLog: * omp-simd-clone.cc (simd_clone_adjust): Convert shift_cnt to match the mask type. diff --git a/gcc/omp-simd-clone.cc b/gcc/omp-simd-clone.cc index 32649bc3f9a..5d3a90730e7 100644 --- a/gcc/omp-simd-clone.cc +++ b/gcc/omp-simd-clone.cc @@ -1305,8 +1305,12 @@ simd_clone_adjust (struct cgraph_node *node) build_int_cst (TREE_TYPE (iter1), c)); gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING); } + tree shift_cnt_conv = make_ssa_name (TREE_TYPE (mask)); + g = gimple_build_assign (shift_cnt_conv, + fold_convert (TREE_TYPE (mask), shift_cnt)); + gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING); g = gimple_build_assign (make_ssa_name (TREE_TYPE (mask)), - RSHIFT_EXPR, mask, shift_cnt); + RSHIFT_EXPR, mask, shift_cnt_conv); gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING); mask = gimple_assign_lhs (g); g = gimple_build_assign (make_ssa_name (TREE_TYPE (mask)),
Re: [PATCH] openmp-simd-clone: Match shift type
On 29/07/2022 16:59, Jakub Jelinek wrote: Doing the fold_convert seems to be a wasted effort to me. Can't this be done conditional on whether some change is needed at all and just using gimple_build_assign with NOP_EXPR, so something like: I'm just not familiar enough with this stuff to run fold_convert in my head with confidence. tree shift_cvt_conv = shift_cnt; if (!useless_type_conversion_p (TREE_TYPE (mask), TREE_TYPE (shift_cnt))) { shift_cnt_conv = make_ssa_name (TREE_TYPE (mask)); g = gimple_build_assign (shift_cnt_conv, NOP_EXPR, shift_cnt); gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING); } Your version gives the same output mine does, at least on amdgcn anyway. Am I OK to commit this version? Andrew openmp-simd-clone: Match shift types Ensure that both parameters to vector shifts use the same mode. This is most important for amdgcn where the masks are DImode. gcc/ChangeLog: * omp-simd-clone.cc (simd_clone_adjust): Convert shift_cnt to match the mask type. Co-authored-by: Jakub Jelinek diff --git a/gcc/omp-simd-clone.cc b/gcc/omp-simd-clone.cc index 32649bc3f9a..58bd68b129b 100644 --- a/gcc/omp-simd-clone.cc +++ b/gcc/omp-simd-clone.cc @@ -1305,8 +1305,16 @@ simd_clone_adjust (struct cgraph_node *node) build_int_cst (TREE_TYPE (iter1), c)); gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING); } + tree shift_cnt_conv = shift_cnt; + if (!useless_type_conversion_p (TREE_TYPE (mask), + TREE_TYPE (shift_cnt))) + { + shift_cnt_conv = make_ssa_name (TREE_TYPE (mask)); + g = gimple_build_assign (shift_cnt_conv, NOP_EXPR, shift_cnt); + gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING); + } g = gimple_build_assign (make_ssa_name (TREE_TYPE (mask)), - RSHIFT_EXPR, mask, shift_cnt); + RSHIFT_EXPR, mask, shift_cnt_conv); gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING); mask = gimple_assign_lhs (g); g = gimple_build_assign (make_ssa_name (TREE_TYPE (mask)),