from:"Andrew"

set_range_info should return TRUE only when it sets a new value. It was 
currently returning true whenever it set a value, whether it was 
different or not.


With this change,  VRP no longer overwrites global ranges DOM has set.  
2 testcases needed adjusting that were expecting VRP2 to set a range but 
turns out it was really being set in DOM2.   Instead they check for the 
range in the final listing...


Bootstrapped on  x86_64-pc-linux-gnu with no regressions. Pushed.

Andrew
From dae5de2a2353b928cc7099a78d88a40473abefd2 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Wed, 27 Sep 2023 12:34:16 -0400
Subject: [PATCH 1/5] Return TRUE only when a global value is updated.

set_range_info should return TRUE only when it sets a new value.  VRP no
longer overwrites global ranges DOM has set.  Check for ranges in the
final listing.

	gcc/
	* tree-ssanames.cc (set_range_info): Return true only if the
	current value changes.

	gcc/testsuite/
	* gcc.dg/pr93917.c: Check for ranges in final optimized listing.
	* gcc.dg/tree-ssa/vrp-unreachable.c: Ditto.
---
 gcc/testsuite/gcc.dg/pr93917.c|  4 ++--
 .../gcc.dg/tree-ssa/vrp-unreachable.c |  4 ++--
 gcc/tree-ssanames.cc  | 24 +--
 3 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/pr93917.c b/gcc/testsuite/gcc.dg/pr93917.c
index f09e1c41ae8..f636b77f45d 100644
--- a/gcc/testsuite/gcc.dg/pr93917.c
+++ b/gcc/testsuite/gcc.dg/pr93917.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-vrp2" } */
+/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-vrp2 -fdump-tree-optimized-alias" } */
 
 void f3(int n);
 
@@ -19,5 +19,5 @@ void f2(int*n)
 
 /* { dg-final { scan-tree-dump-times "Global Export.*0, \\+INF" 1 "vrp1" } } */
 /* { dg-final { scan-tree-dump-times "__builtin_unreachable" 1 "vrp1" } } */
-/* { dg-final { scan-tree-dump-times "Global Export.*0, \\+INF" 1 "vrp2" } } */
 /* { dg-final { scan-tree-dump-times "__builtin_unreachable" 0 "vrp2" } } */
+/* { dg-final { scan-tree-dump-times "0, \\+INF" 2 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vrp-unreachable.c b/gcc/testsuite/gcc.dg/tree-ssa/vrp-unreachable.c
index 5835dfc8dbc..4aad7f1be5d 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/vrp-unreachable.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vrp-unreachable.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-vrp1-alias -fdump-tree-vrp2-alias" } */
+/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-vrp2 -fdump-tree-optimized-alias" } */
 
 void dead (unsigned n);
 void alive (unsigned n);
@@ -39,4 +39,4 @@ void func (unsigned n, unsigned m)
 /* { dg-final { scan-tree-dump-not "dead" "vrp1" } } */
 /* { dg-final { scan-tree-dump-times "builtin_unreachable" 1 "vrp1" } } */
 /* { dg-final { scan-tree-dump-not "builtin_unreachable" "vrp2" } } */
-/* { dg-final { scan-tree-dump-times "fff8 VALUE 0x0" 4 "vrp2" } } */
+/* { dg-final { scan-tree-dump-times "fff8 VALUE 0x0" 2 "optimized" } } */
diff --git a/gcc/tree-ssanames.cc b/gcc/tree-ssanames.cc
index 23387b90fe3..1eae411ac1c 100644
--- a/gcc/tree-ssanames.cc
+++ b/gcc/tree-ssanames.cc
@@ -418,10 +418,17 @@ set_range_info (tree name, const vrange &r)
   if (r.undefined_p () || r.varying_p ())
 return false;
 
+  // Pick up the current range, or VARYING if none.
   tree type = TREE_TYPE (name);
+  Value_Range tmp (type);
+  if (range_info_p (name))
+range_info_get_range (name, tmp);
+  else
+tmp.set_varying (type);
+
   if (POINTER_TYPE_P (type))
 {
-  if (r.nonzero_p ())
+  if (r.nonzero_p () && !tmp.nonzero_p ())
 	{
 	  set_ptr_nonnull (name);
 	  return true;
@@ -429,18 +436,11 @@ set_range_info (tree name, const vrange &r)
   return false;
 }
 
-  /* If a global range already exists, incorporate it.  */
-  if (range_info_p (name))
-{
-  Value_Range tmp (type);
-  range_info_get_range (name, tmp);
-  tmp.intersect (r);
-  if (tmp.undefined_p ())
-	return false;
+  // If the result doesn't change, or is undefined, return false.
+  if (!tmp.intersect (r) || tmp.undefined_p ())
+return false;
 
-  return range_info_set_range (name, tmp);
-}
-  return range_info_set_range (name, r);
+  return range_info_set_range (name, tmp);
 }
 
 /* Set nonnull attribute to pointer NAME.  */
-- 
2.41.0

[COMMITTED] Remove pass counting in VRP.

Pass counting in VRP is used to decide when to call early VRP, pass the 
flag to enable warnings, and when the final pass is.


If you try to add additional passes, this becomes quite fragile. This 
patch simply chooses the pass based on the data pointer passed in, and 
remove the pass counter.   The first FULL VRP pass invokes the warning 
code, and the flag passed in now represents the FINAL pass of VRP.  
There is no longer a global flag which, as it turns out, wasn't working 
well with the JIT compiler, but when undetected.  (Thanks to dmalcolm 
for helping me sort out what was going on there)



Bootstraps  on x86_64-pc-linux-gnu with no regressions.   Pushed.

Andrew
From 29abc475a360ad14d5f692945f2805fba1fdc679 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Thu, 28 Sep 2023 09:19:32 -0400
Subject: [PATCH 2/5] Remove pass counting in VRP.

Rather than using a pass count to decide which parameters are passed to
VRP, makemit explicit.

	* passes.def (pass_vrp): Use parameter for final pass flag..
	* tree-vrp.cc (vrp_pass_num): Remove.
	(run_warning_pass): New.
	(pass_vrp::my_pass): Remove.
	(pass_vrp::final_p): New.
	(pass_vrp::set_pass_param): Set final_p param.
	(pass_vrp::execute): Choose specific pass based on data pointer.
---
 gcc/passes.def  |  4 ++--
 gcc/tree-vrp.cc | 26 +-
 2 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/gcc/passes.def b/gcc/passes.def
index 4110a472914..2bafd60bbfb 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -221,7 +221,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_fre, true /* may_iterate */);
   NEXT_PASS (pass_merge_phi);
   NEXT_PASS (pass_thread_jumps_full, /*first=*/true);
-  NEXT_PASS (pass_vrp, true /* warn_array_bounds_p */);
+  NEXT_PASS (pass_vrp, false /* final_p*/);
   NEXT_PASS (pass_dse);
   NEXT_PASS (pass_dce);
   /* pass_stdarg is always run and at this point we execute
@@ -348,7 +348,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
   NEXT_PASS (pass_strlen);
   NEXT_PASS (pass_thread_jumps_full, /*first=*/false);
-  NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
+  NEXT_PASS (pass_vrp, true /* final_p */);
   /* Run CCP to compute alignment and nonzero bits.  */
   NEXT_PASS (pass_ccp, true /* nonzero_p */);
   NEXT_PASS (pass_warn_restrict);
diff --git a/gcc/tree-vrp.cc b/gcc/tree-vrp.cc
index d7b194f5904..05266dfe34a 100644
--- a/gcc/tree-vrp.cc
+++ b/gcc/tree-vrp.cc
@@ -1120,36 +1120,44 @@ const pass_data pass_data_early_vrp =
   ( TODO_cleanup_cfg | TODO_update_ssa | TODO_verify_all ),
 };
 
-static int vrp_pass_num = 0;
+static bool run_warning_pass = true;
 class pass_vrp : public gimple_opt_pass
 {
 public:
   pass_vrp (gcc::context *ctxt, const pass_data &data_)
-: gimple_opt_pass (data_, ctxt), data (data_), warn_array_bounds_p (false),
-  my_pass (vrp_pass_num++)
-  {}
+: gimple_opt_pass (data_, ctxt), data (data_),
+  warn_array_bounds_p (false), final_p (false)
+  {
+// Only the frst VRP pass should run warnings.
+if (&data == &pass_data_vrp)
+  {
+	warn_array_bounds_p = run_warning_pass;
+	run_warning_pass = false;
+  }
+  }
 
   /* opt_pass methods: */
   opt_pass * clone () final override { return new pass_vrp (m_ctxt, data); }
   void set_pass_param (unsigned int n, bool param) final override
 {
   gcc_assert (n == 0);
-  warn_array_bounds_p = param;
+  final_p = param;
 }
   bool gate (function *) final override { return flag_tree_vrp != 0; }
   unsigned int execute (function *fun) final override
 {
   // Early VRP pass.
-  if (my_pass == 0)
-	return execute_ranger_vrp (fun, /*warn_array_bounds_p=*/false, false);
+  if (&data == &pass_data_early_vrp)
+	return execute_ranger_vrp (fun, /*warn_array_bounds_p=*/false,
+   /*final_p=*/false);
 
-  return execute_ranger_vrp (fun, warn_array_bounds_p, my_pass == 2);
+  return execute_ranger_vrp (fun, warn_array_bounds_p, final_p);
 }
 
  private:
   const pass_data &data;
   bool warn_array_bounds_p;
-  int my_pass;
+  bool final_p;
 }; // class pass_vrp
 
 const pass_data pass_data_assumptions =
-- 
2.41.0

Re: [COMMITTED] Return TRUE only when a global value is updated.


huh.  thanks,  I'll have a look.


Andrew

On 10/3/23 11:47, David Edelsohn wrote:

This patch caused a bootstrap failure on AIX.

during GIMPLE pass: evrp

/nasfarm/edelsohn/src/src/libgcc/libgcc2.c: In function '__gcc_bcmp':

/nasfarm/edelsohn/src/src/libgcc/libgcc2.c:2910:1: internal compiler 
error: in get_irange, at value-range-storage.cc:343


2910 | }

| ^


0x11b7f4b7 irange_storage::get_irange(irange&, tree_node*) const

/nasfarm/edelsohn/src/src/gcc/value-range-storage.cc:343

0x11b7e7af vrange_storage::get_vrange(vrange&, tree_node*) const

/nasfarm/edelsohn/src/src/gcc/value-range-storage.cc:178

0x139f3d77 range_info_get_range(tree_node const*, vrange&)

/nasfarm/edelsohn/src/src/gcc/tree-ssanames.cc:118

0x1134b463 set_range_info(tree_node*, vrange const&)

/nasfarm/edelsohn/src/src/gcc/tree-ssanames.cc:425

0x116a7333 gimple_ranger::register_inferred_ranges(gimple*)

/nasfarm/edelsohn/src/src/gcc/gimple-range.cc:487

0x125cef27 rvrp_folder::fold_stmt(gimple_stmt_iterator*)

/nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1033

0x123dd063 
substitute_and_fold_dom_walker::before_dom_children(basic_block_def*)


/nasfarm/edelsohn/src/src/gcc/tree-ssa-propagate.cc:876

0x1176cc43 dom_walker::walk(basic_block_def*)

/nasfarm/edelsohn/src/src/gcc/domwalk.cc:311

0x123dd733 
substitute_and_fold_engine::substitute_and_fold(basic_block_def*)


/nasfarm/edelsohn/src/src/gcc/tree-ssa-propagate.cc:999

0x123d0f5f execute_ranger_vrp(function*, bool, bool)

/nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1062

0x123d14ef execute

/nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1142

Re: [COMMITTED] Return TRUE only when a global value is updated.

Give this a try..  I'm testing it here, but x86 doesn't seem to show it 
anyway for some reason :-P


I think i needed to handle pointers special since SSA_NAMES handle 
pointer ranges different.


Andrew

On 10/3/23 11:47, David Edelsohn wrote:

This patch caused a bootstrap failure on AIX.

during GIMPLE pass: evrp

/nasfarm/edelsohn/src/src/libgcc/libgcc2.c: In function '__gcc_bcmp':

/nasfarm/edelsohn/src/src/libgcc/libgcc2.c:2910:1: internal compiler 
error: in get_irange, at value-range-storage.cc:343


2910 | }

| ^


0x11b7f4b7 irange_storage::get_irange(irange&, tree_node*) const

/nasfarm/edelsohn/src/src/gcc/value-range-storage.cc:343

0x11b7e7af vrange_storage::get_vrange(vrange&, tree_node*) const

/nasfarm/edelsohn/src/src/gcc/value-range-storage.cc:178

0x139f3d77 range_info_get_range(tree_node const*, vrange&)

/nasfarm/edelsohn/src/src/gcc/tree-ssanames.cc:118

0x1134b463 set_range_info(tree_node*, vrange const&)

/nasfarm/edelsohn/src/src/gcc/tree-ssanames.cc:425

0x116a7333 gimple_ranger::register_inferred_ranges(gimple*)

/nasfarm/edelsohn/src/src/gcc/gimple-range.cc:487

0x125cef27 rvrp_folder::fold_stmt(gimple_stmt_iterator*)

/nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1033

0x123dd063 
substitute_and_fold_dom_walker::before_dom_children(basic_block_def*)


/nasfarm/edelsohn/src/src/gcc/tree-ssa-propagate.cc:876

0x1176cc43 dom_walker::walk(basic_block_def*)

/nasfarm/edelsohn/src/src/gcc/domwalk.cc:311

0x123dd733 
substitute_and_fold_engine::substitute_and_fold(basic_block_def*)


/nasfarm/edelsohn/src/src/gcc/tree-ssa-propagate.cc:999

0x123d0f5f execute_ranger_vrp(function*, bool, bool)

/nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1062

0x123d14ef execute

/nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1142
diff --git a/gcc/tree-ssanames.cc b/gcc/tree-ssanames.cc
index 1eae411ac1c..1401f67c781 100644
--- a/gcc/tree-ssanames.cc
+++ b/gcc/tree-ssanames.cc
@@ -420,15 +420,11 @@ set_range_info (tree name, const vrange &r)
 
   // Pick up the current range, or VARYING if none.
   tree type = TREE_TYPE (name);
-  Value_Range tmp (type);
-  if (range_info_p (name))
-range_info_get_range (name, tmp);
-  else
-tmp.set_varying (type);
-
   if (POINTER_TYPE_P (type))
 {
-  if (r.nonzero_p () && !tmp.nonzero_p ())
+  struct ptr_info_def *pi = get_ptr_info (name);
+  // If R is nonnull and pi is not, set nonnull.
+  if (r.nonzero_p () && (!pi || !pi->pt.null))
 	{
 	  set_ptr_nonnull (name);
 	  return true;
@@ -436,6 +432,11 @@ set_range_info (tree name, const vrange &r)
   return false;
 }
 
+  Value_Range tmp (type);
+  if (range_info_p (name))
+range_info_get_range (name, tmp);
+  else
+tmp.set_varying (type);
   // If the result doesn't change, or is undefined, return false.
   if (!tmp.intersect (r) || tmp.undefined_p ())
 return false;

Re: [COMMITTED] Return TRUE only when a global value is updated.


perfect.  I'll check it in when my testrun is done.

Thanks  .. .  and sorry :-)

Andrew

On 10/3/23 12:53, David Edelsohn wrote:

AIX bootstrap is happier with the patch.

Thanks, David

On Tue, Oct 3, 2023 at 12:30 PM Andrew MacLeod  
wrote:


Give this a try..  I'm testing it here, but x86 doesn't seem to
show it
anyway for some reason :-P

I think i needed to handle pointers special since SSA_NAMES handle
pointer ranges different.

Andrew

On 10/3/23 11:47, David Edelsohn wrote:
> This patch caused a bootstrap failure on AIX.
>
> during GIMPLE pass: evrp
>
> /nasfarm/edelsohn/src/src/libgcc/libgcc2.c: In function
'__gcc_bcmp':
>
> /nasfarm/edelsohn/src/src/libgcc/libgcc2.c:2910:1: internal
compiler
> error: in get_irange, at value-range-storage.cc:343
>
> 2910 | }
>
> | ^
>
>
> 0x11b7f4b7 irange_storage::get_irange(irange&, tree_node*) const
>
> /nasfarm/edelsohn/src/src/gcc/value-range-storage.cc:343
>
> 0x11b7e7af vrange_storage::get_vrange(vrange&, tree_node*) const
>
> /nasfarm/edelsohn/src/src/gcc/value-range-storage.cc:178
>
> 0x139f3d77 range_info_get_range(tree_node const*, vrange&)
>
> /nasfarm/edelsohn/src/src/gcc/tree-ssanames.cc:118
>
> 0x1134b463 set_range_info(tree_node*, vrange const&)
>
> /nasfarm/edelsohn/src/src/gcc/tree-ssanames.cc:425
>
> 0x116a7333 gimple_ranger::register_inferred_ranges(gimple*)
>
> /nasfarm/edelsohn/src/src/gcc/gimple-range.cc:487
>
> 0x125cef27 rvrp_folder::fold_stmt(gimple_stmt_iterator*)
>
> /nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1033
>
> 0x123dd063
>
substitute_and_fold_dom_walker::before_dom_children(basic_block_def*)
>
> /nasfarm/edelsohn/src/src/gcc/tree-ssa-propagate.cc:876
>
> 0x1176cc43 dom_walker::walk(basic_block_def*)
>
> /nasfarm/edelsohn/src/src/gcc/domwalk.cc:311
>
> 0x123dd733
> substitute_and_fold_engine::substitute_and_fold(basic_block_def*)
>
> /nasfarm/edelsohn/src/src/gcc/tree-ssa-propagate.cc:999
>
> 0x123d0f5f execute_ranger_vrp(function*, bool, bool)
>
> /nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1062
>
> 0x123d14ef execute
>
> /nasfarm/edelsohn/src/src/gcc/tree-vrp.cc:1142
>

Re: [COMMITTED] Remove pass counting in VRP.



On 10/3/23 13:02, David Malcolm wrote:

On Tue, 2023-10-03 at 10:32 -0400, Andrew MacLeod wrote:

Pass counting in VRP is used to decide when to call early VRP, pass
the
flag to enable warnings, and when the final pass is.

If you try to add additional passes, this becomes quite fragile. This
patch simply chooses the pass based on the data pointer passed in,
and
remove the pass counter.   The first FULL VRP pass invokes the
warning
code, and the flag passed in now represents the FINAL pass of VRP.
There is no longer a global flag which, as it turns out, wasn't
working
well with the JIT compiler, but when undetected.  (Thanks to dmalcolm
for helping me sort out what was going on there)


Bootstraps  on x86_64-pc-linux-gnu with no regressions.   Pushed.

[CCing jit mailing list]

I'm worried that this patch may have "papered over" an issue with
libgccjit.  Specifically:


well, that isnt the patch that was checked in :-P

Im not sure how the old version got into the commit note.

Attached is the version checked in.

commit 7eb5ce7f58ed4a48641e1786e4fdeb2f7fb8c5ff
Author: Andrew MacLeod 
Date:   Thu Sep 28 09:19:32 2023 -0400

Remove pass counting in VRP.

Rather than using a pass count to decide which parameters are passed to
VRP, makemit explicit.

* passes.def (pass_vrp): Pass "final pass" flag as parameter.
* tree-vrp.cc (vrp_pass_num): Remove.
(pass_vrp::my_pass): Remove.
(pass_vrp::pass_vrp): Add warn_p as a parameter.
(pass_vrp::final_p): New.
(pass_vrp::set_pass_param): Set final_p param.
(pass_vrp::execute): Call execute_range_vrp with no conditions.
(make_pass_vrp): Pass additional parameter.
(make_pass_early_vrp): Ditto.

diff --git a/gcc/passes.def b/gcc/passes.def
index 4110a472914..2bafd60bbfb 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -221,7 +221,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_fre, true /* may_iterate */);
   NEXT_PASS (pass_merge_phi);
   NEXT_PASS (pass_thread_jumps_full, /*first=*/true);
-  NEXT_PASS (pass_vrp, true /* warn_array_bounds_p */);
+  NEXT_PASS (pass_vrp, false /* final_p*/);
   NEXT_PASS (pass_dse);
   NEXT_PASS (pass_dce);
   /* pass_stdarg is always run and at this point we execute
@@ -348,7 +348,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
   NEXT_PASS (pass_strlen);
   NEXT_PASS (pass_thread_jumps_full, /*first=*/false);
-  NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
+  NEXT_PASS (pass_vrp, true /* final_p */);
   /* Run CCP to compute alignment and nonzero bits.  */
   NEXT_PASS (pass_ccp, true /* nonzero_p */);
   NEXT_PASS (pass_warn_restrict);
diff --git a/gcc/tree-vrp.cc b/gcc/tree-vrp.cc
index d7b194f5904..4f8c7745461 100644
--- a/gcc/tree-vrp.cc
+++ b/gcc/tree-vrp.cc
@@ -1120,36 +1120,32 @@ const pass_data pass_data_early_vrp =
   ( TODO_cleanup_cfg | TODO_update_ssa | TODO_verify_all ),
 };
 
-static int vrp_pass_num = 0;
 class pass_vrp : public gimple_opt_pass
 {
 public:
-  pass_vrp (gcc::context *ctxt, const pass_data &data_)
-: gimple_opt_pass (data_, ctxt), data (data_), warn_array_bounds_p (false),
-  my_pass (vrp_pass_num++)
-  {}
+  pass_vrp (gcc::context *ctxt, const pass_data &data_, bool warn_p)
+: gimple_opt_pass (data_, ctxt), data (data_),
+  warn_array_bounds_p (warn_p), final_p (false)
+{ }
 
   /* opt_pass methods: */
-  opt_pass * clone () final override { return new pass_vrp (m_ctxt, data); }
+  opt_pass * clone () final override
+{ return new pass_vrp (m_ctxt, data, false); }
   void set_pass_param (unsigned int n, bool param) final override
 {
   gcc_assert (n == 0);
-  warn_array_bounds_p = param;
+  final_p = param;
 }
   bool gate (function *) final override { return flag_tree_vrp != 0; }
   unsigned int execute (function *fun) final override
 {
-  // Early VRP pass.
-  if (my_pass == 0)
-	return execute_ranger_vrp (fun, /*warn_array_bounds_p=*/false, false);
-
-  return execute_ranger_vrp (fun, warn_array_bounds_p, my_pass == 2);
+  return execute_ranger_vrp (fun, warn_array_bounds_p, final_p);
 }
 
  private:
   const pass_data &data;
   bool warn_array_bounds_p;
-  int my_pass;
+  bool final_p;
 }; // class pass_vrp
 
 const pass_data pass_data_assumptions =
@@ -1219,13 +1215,13 @@ public:
 gimple_opt_pass *
 make_pass_vrp (gcc::context *ctxt)
 {
-  return new pass_vrp (ctxt, pass_data_vrp);
+  return new pass_vrp (ctxt, pass_data_vrp, true);
 }
 
 gimple_opt_pass *
 make_pass_early_vrp (gcc::context *ctxt)
 {
-  return new pass_vrp (ctxt, pass_data_early_vrp);
+  return new pass_vrp (ctxt, pass_data_early_vrp, false);
 }
 
 gimple_opt_pass *

[COMMITTED] Don't use range_info_get_range for pointers.


Properly check for pointers instead of just using range_info_get_range.

bootstrapped on 86_64-pc-linux-gnu (and presumably AIX too :-) with no 
regressions.


On 10/3/23 12:53, David Edelsohn wrote:

AIX bootstrap is happier with the patch.

Thanks, David

commit d8808c37d29110872fa51b98e71aef9e160b4692
Author: Andrew MacLeod 
Date:   Tue Oct 3 12:32:10 2023 -0400

Don't use range_info_get_range for pointers.

Pointers only track null and nonnull, so we need to handle them
specially.

* tree-ssanames.cc (set_range_info): Use get_ptr_info for
pointers rather than range_info_get_range.

diff --git a/gcc/tree-ssanames.cc b/gcc/tree-ssanames.cc
index 1eae411ac1c..0a32444fbdf 100644
--- a/gcc/tree-ssanames.cc
+++ b/gcc/tree-ssanames.cc
@@ -420,15 +420,11 @@ set_range_info (tree name, const vrange &r)
 
   // Pick up the current range, or VARYING if none.
   tree type = TREE_TYPE (name);
-  Value_Range tmp (type);
-  if (range_info_p (name))
-range_info_get_range (name, tmp);
-  else
-tmp.set_varying (type);
-
   if (POINTER_TYPE_P (type))
 {
-  if (r.nonzero_p () && !tmp.nonzero_p ())
+  struct ptr_info_def *pi = get_ptr_info (name);
+  // If R is nonnull and pi is not, set nonnull.
+  if (r.nonzero_p () && (!pi || pi->pt.null))
{
  set_ptr_nonnull (name);
  return true;
@@ -436,6 +432,11 @@ set_range_info (tree name, const vrange &r)
   return false;
 }
 
+  Value_Range tmp (type);
+  if (range_info_p (name))
+range_info_get_range (name, tmp);
+  else
+tmp.set_varying (type);
   // If the result doesn't change, or is undefined, return false.
   if (!tmp.intersect (r) || tmp.undefined_p ())
 return false;

Re: [PATCH] ipa: Self-DCE of uses of removed call LHSs (PR 108007)

2023-10-04 Thread Andrew Pinski

On Wed, Oct 4, 2023 at 5:08 PM Maciej W. Rozycki  wrote:
>
> On Tue, 3 Oct 2023, Martin Jambor wrote:
>
> > > SSA graph may be deep so this may cause stack overflow, so I think we
> > > should use worklist here (it is also easy to do).
> > >
> > > OK with that change.
> > > Honza
> >
> > I have just committed the following after a bootstrap and testing on
> > x86_64-linux.
>
>  This has regressed the native `powerpc64le-linux-gnu' configuration,
> which doesn't bootstrap here anymore:
>
> Comparing stages 2 and 3
> Bootstrap comparison failure!
> powerpc64le-linux-gnu/libstdc++-v3/src/compatibility-ldbl.o differs
> powerpc64le-linux-gnu/libstdc++-v3/src/.libs/compatibility-ldbl.o differs
>
> I have double-checked this is indeed the offending commit, the compiler
> bootstraps just fine as at commit 7eb5ce7f58ed ("Remove pass counting in
> VRP.").
>
>  Shall I file a PR, or can you handle it regardless?  Let me know if you
> need anything from me.

It is already filed as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111688 .

Thanks,
Andrew

>
>   Maciej

Re: [PATCH]AArch64 Handle copysign (x, -1) expansion efficiently

2023-10-05 Thread Andrew Pinski

_const_vec_duplicate 
> (operands[2]));
> +   if (-1 == real_to_integer (r0))

Likewise.

> + {
> +   emit_insn (gen_ior3 (int_res, arg1, v_sign_bitmask));
> +   emit_move_insn (operands[0], gen_lowpart (mode, int_res));
> +   DONE;
> + }
> +  }
> +
> +operands[2] = force_reg (mode, operands[2]);
> +emit_insn (gen_and3 (sign, arg2, v_sign_bitmask));
>  emit_insn (gen_and3
>(mant, arg1,
> aarch64_simd_gen_const_vector_dup (mode,
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> 24349ecdbbab875f21975f116732a9e53762d4c1..d6c581ad81615b4feb095391cbcf4f5b78fa72f1
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -6940,12 +6940,25 @@ (define_expand "lrint2"
>  (define_expand "copysign3"
>[(match_operand:GPF 0 "register_operand")
> (match_operand:GPF 1 "register_operand")
> -   (match_operand:GPF 2 "register_operand")]
> +   (match_operand:GPF 2 "nonmemory_operand")]
>"TARGET_SIMD"
>  {
> -  rtx bitmask = gen_reg_rtx (mode);
> +  machine_mode int_mode = mode;
> +  rtx bitmask = gen_reg_rtx (int_mode);
>emit_move_insn (bitmask, GEN_INT (HOST_WIDE_INT_M1U
> << (GET_MODE_BITSIZE (mode) - 1)));
> +  /* copysign (x, -1) should instead be expanded as orr with the sign
> + bit.  */
> +  auto r0 = CONST_DOUBLE_REAL_VALUE (operands[2]);
> +  if (-1 == real_to_integer (r0))

Likewise.

Thanks,
Andrew

> +{
> +  emit_insn (gen_ior3 (
> +   lowpart_subreg (int_mode, operands[0], mode),
> +   lowpart_subreg (int_mode, operands[1], mode), bitmask));
> +  DONE;
> +}
> +
> +  operands[2] = force_reg (mode, operands[2]);
>emit_insn (gen_copysign3_insn (operands[0], operands[1], operands[2],
>bitmask));
>DONE;
>
>
>
>
> --

[COMMITTED 2/3] Add a dom based ranger for fast VRP.

This patch adds a DOM based ranger that is intended to be used by a dom 
walk pass and provides basic ranges.


It utilizes the new GORI edge API to find outgoing ranges on edges, and 
combines these with any ranges calculated during the walk up to this 
point.  When a query is made for a range not defined in the current 
block, a quick dom walk is performed looking for a range either on a 
single-pred  incoming  edge or defined in the block.


Its about twice the speed of current EVRP, and although there is a bit 
of room to improve both memory usage and speed, I'll leave that until I 
either get around to it or we elect to use it and it becomes more 
important.   It also serves as a POC for anyone wanting to use the new 
GORI API to use edge ranges, as well as a potentially different fast VRP 
more similar to the old EVRP. This version performs more folding of PHI 
nodes as it has all the info on incoming edges, but at a slight cost, 
mostly memory.  It does no relation processing as yet.


It has been bootstrapped running right after EVRP, and as a replacement 
for EVRP, and since it uses existing machinery, should be reasonably 
solid.   It is currently not invoked from anywhere.


Pushed.

Andrew



From ad8cd713b4e489826e289551b8b8f8f708293a5b Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Fri, 28 Jul 2023 13:18:15 -0400
Subject: [PATCH 2/3] Add a dom based ranger for fast VRP.

Provide a dominator based implementation of a range query.

	* gimple_range.cc (dom_ranger::dom_ranger): New.
	(dom_ranger::~dom_ranger): New.
	(dom_ranger::range_of_expr): New.
	(dom_ranger::edge_range): New.
	(dom_ranger::range_on_edge): New.
	(dom_ranger::range_in_bb): New.
	(dom_ranger::range_of_stmt): New.
	(dom_ranger::maybe_push_edge): New.
	(dom_ranger::pre_bb): New.
	(dom_ranger::post_bb): New.
	* gimple-range.h (class dom_ranger): New.
---
 gcc/gimple-range.cc | 300 
 gcc/gimple-range.h  |  28 +
 2 files changed, 328 insertions(+)

diff --git a/gcc/gimple-range.cc b/gcc/gimple-range.cc
index 13c3308d537..5e9bb397a20 100644
--- a/gcc/gimple-range.cc
+++ b/gcc/gimple-range.cc
@@ -928,3 +928,303 @@ assume_query::dump (FILE *f)
 }
   fprintf (f, "--\n");
 }
+
+// ---
+
+
+// Create a DOM based ranger for use by a DOM walk pass.
+
+dom_ranger::dom_ranger () : m_global (), m_out ()
+{
+  m_freelist.create (0);
+  m_freelist.truncate (0);
+  m_e0.create (0);
+  m_e0.safe_grow_cleared (last_basic_block_for_fn (cfun));
+  m_e1.create (0);
+  m_e1.safe_grow_cleared (last_basic_block_for_fn (cfun));
+  m_pop_list = BITMAP_ALLOC (NULL);
+  if (dump_file && (param_ranger_debug & RANGER_DEBUG_TRACE))
+tracer.enable_trace ();
+}
+
+// Dispose of a DOM ranger.
+
+dom_ranger::~dom_ranger ()
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+{
+  fprintf (dump_file, "Non-varying global ranges:\n");
+  fprintf (dump_file, "=:\n");
+  m_global.dump (dump_file);
+}
+  BITMAP_FREE (m_pop_list);
+  m_e1.release ();
+  m_e0.release ();
+  m_freelist.release ();
+}
+
+// Implement range of EXPR on stmt S, and return it in R.
+// Return false if no range can be calculated.
+
+bool
+dom_ranger::range_of_expr (vrange &r, tree expr, gimple *s)
+{
+  unsigned idx;
+  if (!gimple_range_ssa_p (expr))
+return get_tree_range (r, expr, s);
+
+  if ((idx = tracer.header ("range_of_expr ")))
+{
+  print_generic_expr (dump_file, expr, TDF_SLIM);
+  if (s)
+	{
+	  fprintf (dump_file, " at ");
+	  print_gimple_stmt (dump_file, s, 0, TDF_SLIM);
+	}
+  else
+	  fprintf (dump_file, "\n");
+}
+
+  if (s)
+range_in_bb (r, gimple_bb (s), expr);
+  else
+m_global.range_of_expr (r, expr, s);
+
+  if (idx)
+tracer.trailer (idx, " ", true, expr, r);
+  return true;
+}
+
+
+// Return TRUE and the range if edge E has a range set for NAME in
+// block E->src.
+
+bool
+dom_ranger::edge_range (vrange &r, edge e, tree name)
+{
+  bool ret = false;
+  basic_block bb = e->src;
+
+  // Check if BB has any outgoing ranges on edge E.
+  ssa_lazy_cache *out = NULL;
+  if (EDGE_SUCC (bb, 0) == e)
+out = m_e0[bb->index];
+  else if (EDGE_SUCC (bb, 1) == e)
+out = m_e1[bb->index];
+
+  // If there is an edge vector and it has a range, pick it up.
+  if (out && out->has_range (name))
+ret = out->get_range (r, name);
+
+  return ret;
+}
+
+
+// Return the range of EXPR on edge E in R.
+// Return false if no range can be calculated.
+
+bool
+dom_ranger::range_on_edge (vrange &r, edge e, tree expr)
+{
+  basic_block bb = e->src;
+  unsigned idx;
+  if ((idx = tracer.header ("range_on_edge ")))
+{
+  fprintf (dump_file, "%d->%d for ",e->src->index, e->d

[COMMITTED 1/3] Add outgoing range vector calculation API.


This patch adds 2 routine that can be called to generate GORI information.

The primar API is:
bool gori_on_edge (class ssa_cache &r, edge e, range_query *query = 
NULL, gimple_outgoing_range *ogr = NULL);


This will populate an ssa-cache R with any ranges that are generated by 
edge E.   It will use QUERY, if provided, to satisfy any incoming 
values.  if OGR is provided, it is used to pick up hard edge values.. 
like TRUE, FALSE, or switch edges.


It currently only works for TRUE/FALSE conditionals, and doesn't try to 
solve complex logical combinations.  ie (a <6 && b > 6) || (a>10 || b < 
3) as those can get exponential and require multiple evaluations of the 
IL to satisfy.  It will fully utilize range-ops however and so comes up 
with many ranges ranger does.


It also provides the "raw" ranges on the edge.. ie. it doesn't try to 
figure out anything outside the current basic block, but rather reflects 
exactly what the edge indicates.


ie:

   :
  x.0_1 = (unsigned int) x_20(D);
  _2 = x.0_1 + 4294967292;
  if (_2 > 4)
    goto ; [INV]
  else
    goto ; [INV]

produces

Edge ranges BB 2->3
x.0_1  : [irange] unsigned int [0, 3][9, +INF]
_2  : [irange] unsigned int [5, +INF]
x_20(D)  : [irange] int [-INF, 3][9, +INF]

Edge ranges BB 2->4
x.0_1  : [irange] unsigned int [4, 8] MASK 0xf VALUE 0x0
_2  : [irange] unsigned int [0, 4]
x_20(D)  : [irange] int [4, 8] MASK 0xf VALUE 0x0

It performs a linear walk through juts the required statements, so each 
of the the above vectors are generated by visiting each of the 3 
statements exactly once, so its pretty quick.



The other entry point is:
bool gori_name_on_edge (vrange &r, tree name, edge e, range_query *q);

This does basically the same thing, except it only looks at whether NAME 
has a range, and returns it if it does.  not other overhead.


Pushed.
From 52c1e2c805bc2fd7a30583dce3608b738f3a5ce4 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Tue, 15 Aug 2023 17:29:58 -0400
Subject: [PATCH 1/3] Add outgoing range vector calcualtion API

Provide a GORI API which can produce a range vector for all outgoing
ranges on an edge without any of the other infratructure.

	* gimple-range-gori.cc (gori_stmt_info::gori_stmt_info): New.
	(gori_calc_operands): New.
	(gori_on_edge): New.
	(gori_name_helper): New.
	(gori_name_on_edge): New.
	* gimple-range-gori.h (gori_on_edge): New prototype.
	(gori_name_on_edge): New prototype.
---
 gcc/gimple-range-gori.cc | 213 +++
 gcc/gimple-range-gori.h  |  15 +++
 2 files changed, 228 insertions(+)

diff --git a/gcc/gimple-range-gori.cc b/gcc/gimple-range-gori.cc
index 2694e551d73..1b5eda43390 100644
--- a/gcc/gimple-range-gori.cc
+++ b/gcc/gimple-range-gori.cc
@@ -1605,3 +1605,216 @@ gori_export_iterator::get_name ()
 }
   return NULL_TREE;
 }
+
+// This is a helper class to set up STMT with a known LHS for further GORI
+// processing.
+
+class gori_stmt_info : public gimple_range_op_handler
+{
+public:
+  gori_stmt_info (vrange &lhs, gimple *stmt, range_query *q);
+  Value_Range op1_range;
+  Value_Range op2_range;
+  tree ssa1;
+  tree ssa2;
+};
+
+
+// Uses query Q to get the known ranges on STMT with a LHS range
+// for op1_range and op2_range and set ssa1 and ssa2 if either or both of
+// those operands are SSA_NAMES.
+
+gori_stmt_info::gori_stmt_info (vrange &lhs, gimple *stmt, range_query *q)
+  : gimple_range_op_handler (stmt)
+{
+  ssa1 = NULL;
+  ssa2 = NULL;
+  // Don't handle switches as yet for vector processing.
+  if (is_a (stmt))
+return;
+
+  // No frther processing for VARYING or undefined.
+  if (lhs.undefined_p () || lhs.varying_p ())
+return;
+
+  // If there is no range-op handler, we are also done.
+  if (!*this)
+return;
+
+  // Only evaluate logical cases if both operands must be the same as the LHS.
+  // Otherwise its becomes exponential in time, as well as more complicated.
+  if (is_gimple_logical_p (stmt))
+{
+  gcc_checking_assert (range_compatible_p (lhs.type (), boolean_type_node));
+  enum tree_code code = gimple_expr_code (stmt);
+  if (code == TRUTH_OR_EXPR ||  code == BIT_IOR_EXPR)
+	{
+	  // [0, 0] = x || y  means both x and y must be zero.
+	  if (!lhs.singleton_p () || !lhs.zero_p ())
+	return;
+	}
+  else if (code == TRUTH_AND_EXPR ||  code == BIT_AND_EXPR)
+	{
+	  // [1, 1] = x && y  means both x and y must be one.
+	  if (!lhs.singleton_p () || lhs.zero_p ())
+	return;
+	}
+}
+
+  tree op1 = operand1 ();
+  tree op2 = operand2 ();
+  ssa1 = gimple_range_ssa_p (op1);
+  ssa2 = gimple_range_ssa_p (op2);
+  // If both operands are the same, only process one of them.
+  if (ssa1 && ssa1 == ssa2)
+ssa2 = NULL_TREE;
+
+  // Extract current ranges for the operands.
+  fur_stmt src (stmt, q);
+  if (op1)
+{
+  op1_range.set_type (TREE_TYPE (op1));
+  src.get_operand (op1_range, o

[COMMITTED 3/3] Create a fast VRP pass

This patch adds a fast VRP pass.  It is not invoked from anywhere, so 
should cause no issues.


If you want to utilize it, simply add a new pass, ie:

--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -92,6 +92,7 @@ along with GCC; see the file COPYING3.  If not see
  NEXT_PASS (pass_phiprop);
  NEXT_PASS (pass_fre, true /* may_iterate */);
  NEXT_PASS (pass_early_vrp);
+ NEXT_PASS (pass_fast_vrp);
  NEXT_PASS (pass_merge_phi);
   NEXT_PASS (pass_dse);
  NEXT_PASS (pass_cd_dce, false /* update_address_taken_p */);

it will generate a dump file with the extension .fvrp.


pushed.

From f4e2dac53fd62fbf2af95e0bf26d24e929fa1f66 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Mon, 2 Oct 2023 18:32:49 -0400
Subject: [PATCH 3/3] Create a fast VRP pass

	* timevar.def (TV_TREE_FAST_VRP): New.
	* tree-pass.h (make_pass_fast_vrp): New prototype.
	* tree-vrp.cc (class fvrp_folder): New.
	(fvrp_folder::fvrp_folder): New.
	(fvrp_folder::~fvrp_folder): New.
	(fvrp_folder::value_of_expr): New.
	(fvrp_folder::value_on_edge): New.
	(fvrp_folder::value_of_stmt): New.
	(fvrp_folder::pre_fold_bb): New.
	(fvrp_folder::post_fold_bb): New.
	(fvrp_folder::pre_fold_stmt): New.
	(fvrp_folder::fold_stmt): New.
	(execute_fast_vrp): New.
	(pass_data_fast_vrp): New.
	(pass_vrp:execute): Check for fast VRP pass.
	(make_pass_fast_vrp): New.
---
 gcc/timevar.def |   1 +
 gcc/tree-pass.h |   1 +
 gcc/tree-vrp.cc | 124 
 3 files changed, 126 insertions(+)

diff --git a/gcc/timevar.def b/gcc/timevar.def
index 9523598f60e..d21b08c030d 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -160,6 +160,7 @@ DEFTIMEVAR (TV_TREE_TAIL_MERGE   , "tree tail merge")
 DEFTIMEVAR (TV_TREE_VRP  , "tree VRP")
 DEFTIMEVAR (TV_TREE_VRP_THREADER , "tree VRP threader")
 DEFTIMEVAR (TV_TREE_EARLY_VRP, "tree Early VRP")
+DEFTIMEVAR (TV_TREE_FAST_VRP , "tree Fast VRP")
 DEFTIMEVAR (TV_TREE_COPY_PROP, "tree copy propagation")
 DEFTIMEVAR (TV_FIND_REFERENCED_VARS  , "tree find ref. vars")
 DEFTIMEVAR (TV_TREE_PTA		 , "tree PTA")
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index eba2d54ac76..9c4b1e4185c 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -470,6 +470,7 @@ extern gimple_opt_pass *make_pass_check_data_deps (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_copy_prop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_isolate_erroneous_paths (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_early_vrp (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_fast_vrp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_vrp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_assumptions (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_uncprop (gcc::context *ctxt);
diff --git a/gcc/tree-vrp.cc b/gcc/tree-vrp.cc
index 4f8c7745461..19d8f995d70 100644
--- a/gcc/tree-vrp.cc
+++ b/gcc/tree-vrp.cc
@@ -1092,6 +1092,106 @@ execute_ranger_vrp (struct function *fun, bool warn_array_bounds_p,
   return 0;
 }
 
+// Implement a Fast VRP folder.  Not quite as effective but faster.
+
+class fvrp_folder : public substitute_and_fold_engine
+{
+public:
+  fvrp_folder (dom_ranger *dr) : substitute_and_fold_engine (),
+ m_simplifier (dr)
+  { m_dom_ranger = dr; }
+
+  ~fvrp_folder () { }
+
+  tree value_of_expr (tree name, gimple *s = NULL) override
+  {
+// Shortcircuit subst_and_fold callbacks for abnormal ssa_names.
+if (TREE_CODE (name) == SSA_NAME && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (name))
+  return NULL;
+return m_dom_ranger->value_of_expr (name, s);
+  }
+
+  tree value_on_edge (edge e, tree name) override
+  {
+// Shortcircuit subst_and_fold callbacks for abnormal ssa_names.
+if (TREE_CODE (name) == SSA_NAME && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (name))
+  return NULL;
+return m_dom_ranger->value_on_edge (e, name);
+  }
+
+  tree value_of_stmt (gimple *s, tree name = NULL) override
+  {
+// Shortcircuit subst_and_fold callbacks for abnormal ssa_names.
+if (TREE_CODE (name) == SSA_NAME && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (name))
+  return NULL;
+return m_dom_ranger->value_of_stmt (s, name);
+  }
+
+  void pre_fold_bb (basic_block bb) override
+  {
+m_dom_ranger->pre_bb (bb);
+// Now process the PHIs in advance.
+gphi_iterator psi = gsi_start_phis (bb);
+for ( ; !gsi_end_p (psi); gsi_next (&psi))
+  {
+	tree name = gimple_range_ssa_p (PHI_RESULT (psi.phi ()));
+	if (name)
+	  {
+	Value_Range vr(TREE_TYPE (name));
+	m_dom_ranger->range_of_stmt (vr, psi.phi (), name);
+	  }
+  }
+  }
+
+  void post_fold_bb (basic_block bb) override
+  {
+m_dom_ranger->post_bb (bb);
+  }
+
+  void pre_fold_stmt (gimple *s) override
+  {
+// Ensure range_of_stmt has been called

[COMMITTED 0/3] Add a FAST VRP pass.

the following set of 3 patches provide the infrastructure for a fast vrp 
pass.


The pass is currently not invoked anywhere, but I wanted to get the 
infrastructure bits in place now... just in case we want to use it 
somewhere.


It clearly bootstraps with no regressions since it isn't being invoked 
:-)   I have however bootstrapped it with calls to the new fast-vrp pass 
immediately following the EVRP, and as an EVRP replacement .  This is to 
primarily ensure it isn't doing anything harmful.  That is a test of 
sorts :-).


I also ran it instead of EVRP, and it bootstraps, but does trigger a few 
regressions, all related to relation processing, which it doesn't do.


Patch one provides a new API for GORI which simply provides a list of 
all the ranges that it can generate on an outgoing edge. It utilizes the 
sparse ssa-cache, and simply sets the outgoing range as determines by 
the edge.  Its very efficient, only walking up the chain once and not 
generating any other utillity structures.  This provides fats an easy 
access to any info an edge may provide.  There is a second API for 
querying a specific name instead of asking for all the ranges.   It 
should be pretty solid as is simply invokes ranges-ops and other 
components the same way the larger GORI engine does, it just puts them 
together in a different way


Patch 2 is the new DOM ranger.  It assumes it will be called in DOM 
order, and evaluates the statements, and tracks any ranges on outgoing 
edges.  Queries for ranges walk the dom tree looking for a range until 
it finds one on an edge or hits the definition block.   There are 
additional efficiencies that can be employed, and I'll eventually get 
back to them.


Patch 3 is the FAST VRP pass and folder.  Its pretty straightforward, 
invokes the new DOM ranger, and enables  you to add  MAKE_PASS 
(pass_fast_vrp)  in passes. def.


Timewise, it is currently about twice as fast as EVRP.  It does basic 
range evaluation and fold PHIs, etc. It does *not* do relation 
processing or any of the fancier things we do (like statement side 
effects).   A little additional  work can reduce the memory footprint 
further too.  I have done no experiments as yet as to the cot of adding 
relations, but it would be pretty straightforward as it is just reusing 
all the same components the main ranger does


Andrew

Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-05 Thread Andrew Pinski

On Thu, Oct 5, 2023 at 12:48 PM Tamar Christina  wrote:
>
> > -Original Message-
> > From: Richard Sandiford 
> > Sent: Thursday, October 5, 2023 8:29 PM
> > To: Tamar Christina 
> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> > ; Marcus Shawcroft
> > ; Kyrylo Tkachov 
> > Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> >
> > Tamar Christina  writes:
> > > Hi All,
> > >
> > > This adds an implementation for masked copysign along with an
> > > optimized pattern for masked copysign (x, -1).
> >
> > It feels like we're ending up with a lot of AArch64-specific code that just 
> > hard-
> > codes the observation that changing the sign is equivalent to changing the 
> > top
> > bit.  We then need to make sure that we choose the best way of changing the
> > top bit for any given situation.
> >
> > Hard-coding the -1/negative case is one instance of that.  But it looks 
> > like we
> > also fail to use the best sequence for SVE2.  E.g.
> > [https://godbolt.org/z/ajh3MM5jv]:
> >
> > #include 
> >
> > void f(double *restrict a, double *restrict b) {
> > for (int i = 0; i < 100; ++i)
> > a[i] = __builtin_copysign(a[i], b[i]); }
> >
> > void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
> > for (int i = 0; i < 100; ++i)
> > a[i] = (a[i] & ~c) | (b[i] & c); }
> >
> > gives:
> >
> > f:
> > mov x2, 0
> > mov w3, 100
> > whilelo p7.d, wzr, w3
> > .L2:
> > ld1dz30.d, p7/z, [x0, x2, lsl 3]
> > ld1dz31.d, p7/z, [x1, x2, lsl 3]
> > and z30.d, z30.d, #0x7fff
> > and z31.d, z31.d, #0x8000
> > orr z31.d, z31.d, z30.d
> > st1dz31.d, p7, [x0, x2, lsl 3]
> > incdx2
> > whilelo p7.d, w2, w3
> > b.any   .L2
> > ret
> > g:
> > mov x3, 0
> > mov w4, 100
> > mov z29.d, x2
> > whilelo p7.d, wzr, w4
> > .L6:
> > ld1dz30.d, p7/z, [x0, x3, lsl 3]
> > ld1dz31.d, p7/z, [x1, x3, lsl 3]
> > bsl z31.d, z31.d, z30.d, z29.d
> > st1dz31.d, p7, [x0, x3, lsl 3]
> > incdx3
> > whilelo p7.d, w3, w4
> > b.any   .L6
> > ret
> >
> > I saw that you originally tried to do this in match.pd and that the 
> > decision was
> > to fold to copysign instead.  But perhaps there's a compromise where isel 
> > does
> > something with the (new) copysign canonical form?
> > I.e. could we go with your new version of the match.pd patch, and add some
> > isel stuff as a follow-on?
> >
>
> Sure if that's what's desired But..
>
> The example you posted above is for instance worse for x86 
> https://godbolt.org/z/x9ccqxW6T
> where the first operation has a dependency chain of 2 and the latter of 3.  
> It's likely any
> open coding of this operation is going to hurt a target.

But that is because it is not using andn when it should be.
That would be https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94790
(scalar fix but not vector) and
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 IIRC.
AARCH64 already has a pattern to match the above which is why it works
there but not x86_64.

Thanks,
Andrew

>
> So I'm unsure what isel transform this into...
>
> Tamar
>
> > Not saying no to this patch, just thought that the above was worth
> > considering.
> >
> > [I agree with Andrew's comments FWIW.]
> >
> > Thanks,
> > Richard
> >
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > >
> > > Ok for master?
> > >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > PR tree-optimization/109154
> > > * config/aarch64/aarch64-sve.md (cond_copysign): New.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > PR tree-optimization/109154
> > > * gcc.target/aarch64/sve/fneg-abs_5.c: New test.
> > >
> > > --- inline copy of patch --
> > > diff --git a/gcc/config/aarch64/aarch64-sve.md
> > > b/gcc/config/aarch64/aarch64-sve.md
> > > index
> > >
> > 071400c820a5b106ddf9dc9faebb117975d74ea0..00ca30c24624dc661254
> > 568f45b6
> > > 1a14aa11c305 1006

[PATCH] MATCH: Fix infinite loop between `vec_cond(vec_cond(a, b, 0), c, d)` and `a & b`

2023-10-05 Thread Andrew Pinski

Match has a pattern which converts `vec_cond(vec_cond(a,b,0), c, d)`
into `vec_cond(a & b, c, d)` but since in this case a is a comparison
fold will change `a & b` back into `vec_cond(a,b,0)` which causes an
infinite loop.
The best way to fix this is to enable the patterns for vec_cond(*,vec_cond,*)
only for GIMPLE so we don't get an infinite loop for fold any more.

Note this is a latent bug since these patterns were added in 
r11-2577-g229752afe3156a
and was exposed by r14-3350-g47b833a9abe1 where now able to remove a 
VIEW_CONVERT_EXPR.

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

PR middle-end/111699

gcc/ChangeLog:

* match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
(v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): Enable only for GIMPLE.

gcc/testsuite/ChangeLog:

* gcc.c-torture/compile/pr111699-1.c: New test.
---
 gcc/match.pd | 5 +
 gcc/testsuite/gcc.c-torture/compile/pr111699-1.c | 7 +++
 2 files changed, 12 insertions(+)
 create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr111699-1.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 4bdd83e6e06..31bfd8b6b68 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -5045,6 +5045,10 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 /* (v ? w : 0) ? a : b is just (v & w) ? a : b
Currently disabled after pass lvec because ARM understands
VEC_COND_EXPR but not a plain v==w fed to BIT_IOR_EXPR.  */
+#if GIMPLE
+/* These can only be done in gimple as fold likes to convert:
+   (CMP) & N into (CMP) ? N : 0
+   and we try to match the same pattern again and again. */
 (simplify
  (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2)
  (if (optimize_vectors_before_lowering_p () && types_match (@0, @3))
@@ -5079,6 +5083,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (vec_cond @0 @3 (vec_cond:s @1 @2 @3))
  (if (optimize_vectors_before_lowering_p () && types_match (@0, @1))
   (vec_cond (bit_and (bit_not @0) @1) @2 @3)))
+#endif
 
 /* Canonicalize mask ? { 0, ... } : { -1, ...} to ~mask if the mask
types are compatible.  */
diff --git a/gcc/testsuite/gcc.c-torture/compile/pr111699-1.c 
b/gcc/testsuite/gcc.c-torture/compile/pr111699-1.c
new file mode 100644
index 000..87b127ed199
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/pr111699-1.c
@@ -0,0 +1,7 @@
+typedef unsigned char __attribute__((__vector_size__ (8))) V;
+
+void
+foo (V *v)
+{
+  *v =  (V) 0x107B9A7FF >= (*v <= 0);
+}
-- 
2.39.3

[committed] amdgcn: silence warning

2023-10-06 Thread Andrew Stubbs


I've just committed this simple patch to silence an enum warning.

Andrewamdgcn: silence warning

gcc/ChangeLog:

* config/gcn/gcn.cc (print_operand): Adjust xcode type to fix warning.

diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index f6cff659703..ef3b6472a52 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -6991,7 +6991,7 @@ print_operand_address (FILE *file, rtx mem)
 void
 print_operand (FILE *file, rtx x, int code)
 {
-  int xcode = x ? GET_CODE (x) : 0;
+  rtx_code xcode = x ? GET_CODE (x) : UNKNOWN;
   bool invert = false;
   switch (code)
 {

[committed] amdgcn: switch mov insns to compact syntax

2023-10-06 Thread Andrew Stubbs

I've just committed this patch. It should have no functional changes 
except to make it easier to add new alternatives into the 
alternative-heavy move instructions.


Andrewamdgcn: switch mov insns to compact syntax

The move instructions typically have many alternatives (and I'm about to add
more) so are good candidates for the new syntax.

This patch only converts the patterns where there are no significant changes to
the generated files. The other patterns can be converted another time.

gcc/ChangeLog:

* config/gcn/gcn-valu.md (*mov): Convert to compact syntax.
(mov_exec): Likewise.
(mov_sgprbase): Likewise.
* config/gcn/gcn.md (*mov_insn): Likewise.
(*movti_insn): Likewise.

diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index 284dda73da9..32b170e8522 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -457,23 +457,21 @@ (define_insn "*mov"
(set_attr "length" "4,8")])
 
 (define_insn "mov_exec"
-  [(set (match_operand:V_1REG 0 "nonimmediate_operand"  "=v, v, v, v, v, m")
+  [(set (match_operand:V_1REG 0 "nonimmediate_operand")
(vec_merge:V_1REG
- (match_operand:V_1REG 1 "general_operand"  "vA, B, v,vA, m, v")
- (match_operand:V_1REG 2 "gcn_alu_or_unspec_operand"
-"U0,U0,vA,vA,U0,U0")
- (match_operand:DI 3 "register_operand" " e, e,cV,Sv, e, e")))
-   (clobber (match_scratch: 4 "=X, X, X, X,&v,&v"))]
+ (match_operand:V_1REG 1 "general_operand")
+ (match_operand:V_1REG 2 "gcn_alu_or_unspec_operand")
+ (match_operand:DI 3 "register_operand")))
+   (clobber (match_scratch: 4))]
   "!MEM_P (operands[0]) || REG_P (operands[1])"
-  "@
-   v_mov_b32\t%0, %1
-   v_mov_b32\t%0, %1
-   v_cndmask_b32\t%0, %2, %1, vcc
-   v_cndmask_b32\t%0, %2, %1, %3
-   #
-   #"
-  [(set_attr "type" "vop1,vop1,vop2,vop3a,*,*")
-   (set_attr "length" "4,8,4,8,16,16")])
+  {@ [cons: =0, 1, 2, 3, =4; attrs: type, length]
+  [v,vA,U0,e ,X ;vop1 ,4 ] v_mov_b32\t%0, %1
+  [v,B ,U0,e ,X ;vop1 ,8 ] v_mov_b32\t%0, %1
+  [v,v ,vA,cV,X ;vop2 ,4 ] v_cndmask_b32\t%0, %2, %1, vcc
+  [v,vA,vA,Sv,X ;vop3a,8 ] v_cndmask_b32\t%0, %2, %1, %3
+  [v,m ,U0,e ,&v;*,16] #
+  [m,v ,U0,e ,&v;*,16] #
+  })
 
 ; This variant does not accept an unspec, but does permit MEM
 ; read/modify/write which is necessary for maskstore.
@@ -644,19 +642,18 @@ (define_insn "mov_exec"
 ;   flat_load v, vT
 
 (define_insn "mov_sgprbase"
-  [(set (match_operand:V_1REG 0 "nonimmediate_operand" "= v, v, v, m")
+  [(set (match_operand:V_1REG 0 "nonimmediate_operand")
(unspec:V_1REG
- [(match_operand:V_1REG 1 "general_operand"   " vA,vB, m, v")]
+ [(match_operand:V_1REG 1 "general_operand")]
  UNSPEC_SGPRBASE))
-   (clobber (match_operand: 2 "register_operand"  "=&v,&v,&v,&v"))]
+   (clobber (match_operand: 2 "register_operand"))]
   "lra_in_progress || reload_completed"
-  "@
-   v_mov_b32\t%0, %1
-   v_mov_b32\t%0, %1
-   #
-   #"
-  [(set_attr "type" "vop1,vop1,*,*")
-   (set_attr "length" "4,8,12,12")])
+  {@ [cons: =0, 1, =2; attrs: type, length]
+  [v,vA,&v;vop1,4 ] v_mov_b32\t%0, %1
+  [v,vB,&v;vop1,8 ] ^
+  [v,m ,&v;*   ,12] #
+  [m,v ,&v;*   ,12] #
+  })
 
 (define_insn "mov_sgprbase"
   [(set (match_operand:V_2REG 0 "nonimmediate_operand" "= v, v, m")
@@ -676,17 +673,17 @@ (define_insn "mov_sgprbase"
(set_attr "length" "8,12,12")])
 
 (define_insn "mov_sgprbase"
-  [(set (match_operand:V_4REG 0 "nonimmediate_operand" "= v, v, m")
+  [(set (match_operand:V_4REG 0 "nonimmediate_operand")
(unspec:V_4REG
- [(match_operand:V_4REG 1 "general_operand"   "vDB, m, v")]
+ [(match_operand:V_4REG 1 "general_operand")]
  UNSPEC_SGPRBASE))
-   (clobber (match_operand: 2 "register_operand"  "=&v,&v,&v"))]
+   (clobber (match_operand: 2 "register_operand"))]
   "lra_in_progress || reload_completed"
-  "v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1\;v_mov_b32\t%J0, 
%J1\;v_mov_b32\t%K0, %K1
-   #
-   #"
-  [(set_attr "type" "vmult,*,*")
-   (set_attr "length" "8,12,12")])
+  {@ [cons: =0, 1, =2; attrs: type, length]
+  [v,vDB,&v;vmult,8 ] v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, 
%H1\;v_mov_b32\t%J0, %J1\;v_mov_b32\t%K0, %K1
+  [v,m  ,&v;*,12] #
+  [m,v  ,&v;*,12] #
+  })
 
 ; reload_in was once a standard name, but here it's only referenced by
 ; gcn_secondary_reload.  It allows a reload with a scratch register.
diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md
index 7065acf402b..30fe9e34a35 100644
--- a/gcc/config/gcn/gcn.md
+++ b/gcc/config/gcn/gcn.md
@@ -542,87 +542,76 @@ (define_insn "*movbi"
 ; 32bit move pattern
 
 (define_insn "*mov_insn"
-  [(set (match_operand:SISF 0 "nonimmediate_operand"
- "=SD,SD,SD,SD,RB,Sm,RS,v,Sg, v, v,RF,v,RLRG,   v,SD, v,RM")
-   (match_operand:SISF 1 "gcn_load_operand"
- "SSA, J, B,RB,Sm,RS,Sm,v, v,S

Re: [PATCH] test: Isolate slp-1.c check of target supports vect_strided5

2023-10-06 Thread Andrew Stubbs


On 15/09/2023 10:16, Juzhe-Zhong wrote:

This test failed in RISC-V:
FAIL: gcc.dg/vect/slp-1.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorizing stmts using SLP" 4
FAIL: gcc.dg/vect/slp-1.c scan-tree-dump-times vect "vectorizing stmts using 
SLP" 4

Because this loop:
   /* SLP with unrolling by 8.  */
   for (i = 0; i < N; i++)
 {
   out[i*5] = 8;
   out[i*5 + 1] = 7;
   out[i*5 + 2] = 81;
   out[i*5 + 3] = 28;
   out[i*5 + 4] = 18;
 }

is using vect_load_lanes with array size = 5.
instead of SLP.

When we adjust the COST of LANES load store, then it will use SLP.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-1.c: Add vect_stried5.

---
  gcc/testsuite/gcc.dg/vect/slp-1.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-1.c 
b/gcc/testsuite/gcc.dg/vect/slp-1.c
index 82e4f6469fb..d4a13f12df6 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-1.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-1.c
@@ -122,5 +122,5 @@ int main (void)
  }
  
  /* { dg-final { scan-tree-dump-times "vectorized 4 loops" 1 "vect"  } } */

-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" } 
} */
-
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { 
target {! vect_strided5 } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { 
target vect_strided5 } } } */


This patch causes a test regression on amdgcn because vect_strided5 is 
true (because check_effective_target_vect_fully_masked is true), but the 
testcase still gives the message 4 times. Perhaps because amdgcn uses 
masking and not vect_load_lanes?


Andrew

Re: [PATCH]middle-end match.pd: optimize fneg (fabs (x)) to x | (1 << signbit(x)) [PR109154]

2023-10-09 Thread Andrew Pinski

> >>>> --- a/gcc/match.pd
> > >>>> +++ b/gcc/match.pd
> > >>>> @@ -1074,45 +1074,43 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > >>>>
> > >>>> /* cos(copysign(x, y)) -> cos(x).  Similarly for cosh.  */
> > >>>> (for coss (COS COSH)
> > >>>> - copysigns (COPYSIGN)
> > >>>> - (simplify
> > >>>> -  (coss (copysigns @0 @1))
> > >>>> -   (coss @0)))
> > >>>> + (for copysigns (COPYSIGN_ALL)
> > >>>
> > >>> So this ends up generating for example the match
> > >>> (cosf (copysignl ...)) which doesn't make much sense.
> > >>>
> > >>> The lock-step iteration did
> > >>> (cosf (copysignf ..)) ... (ifn_cos (ifn_copysign ...))
> > >>> which is leaner but misses the case of
> > >>> (cosf (ifn_copysign ..)) - that's probably what you are
> > >>> after with this change.
> > >>>
> > >>> That said, there isn't a nice solution (without altering the match.pd
> > >>> IL).  There's the explicit solution, spelling out all combinations.
> > >>>
> > >>> So if we want to go with yout pragmatic solution changing this
> > >>> to use COPYSIGN_ALL isn't necessary, only changing the lock-step
> > >>> for iteration to a cross product for iteration is.
> > >>>
> > >>> Changing just this pattern to
> > >>>
> > >>> (for coss (COS COSH)
> > >>> (for copysigns (COPYSIGN)
> > >>>  (simplify
> > >>>   (coss (copysigns @0 @1))
> > >>>   (coss @0
> > >>>
> > >>> increases the total number of gimple-match-x.cc lines from
> > >>> 234988 to 235324.
> > >>
> > >> I guess the difference between this and the later suggestions is that
> > >> this one allows builtin copysign to be paired with ifn cos, which would
> > >> be potentially useful in other situations.  (It isn't here because
> > >> ifn_cos is rarely provided.)  How much of the growth is due to that,
> > >> and much of it is from nonsensical combinations like
> > >> (builtin_cosf (builtin_copysignl ...))?
> > >>
> > >> If it's mostly from nonsensical combinations then would it be possible
> > >> to make genmatch drop them?
> > >>
> > >>> The alternative is to do
> > >>>
> > >>> (for coss (COS COSH)
> > >>> copysigns (COPYSIGN)
> > >>> (simplify
> > >>>  (coss (copysigns @0 @1))
> > >>>   (coss @0))
> > >>> (simplify
> > >>>  (coss (IFN_COPYSIGN @0 @1))
> > >>>   (coss @0)))
> > >>>
> > >>> which properly will diagnose a duplicate pattern.  Ther are
> > >>> currently no operator lists with just builtins defined (that
> > >>> could be fixed, see gencfn-macros.cc), supposed we'd have
> > >>> COS_C we could do
> > >>>
> > >>> (for coss (COS_C COSH_C IFN_COS IFN_COSH)
> > >>> copysigns (COPYSIGN_C COPYSIGN_C IFN_COPYSIGN IFN_COPYSIGN
> > >>> IFN_COPYSIGN IFN_COPYSIGN IFN_COPYSIGN IFN_COPYSIGN IFN_COPYSIGN
> > >>> IFN_COPYSIGN)
> > >>> (simplify
> > >>>  (coss (copysigns @0 @1))
> > >>>   (coss @0)))
> > >>>
> > >>> which of course still looks ugly ;) (some syntax extension like
> > >>> allowing to specify IFN_COPYSIGN*8 would be nice here and easy
> > >>> enough to do)
> > >>>
> > >>> Can you split out the part changing COPYSIGN to COPYSIGN_ALL,
> > >>> re-do it to only split the fors, keeping COPYSIGN and provide
> > >>> some statistics on the gimple-match-* size?  I think this might
> > >>> be the pragmatic solution for now.
> > >>>
> > >>> Richard - can you think of a clever way to express the desired
> > >>> iteration?  How do RTL macro iterations address cases like this?
> > >>
> > >> I don't think .md files have an equivalent construct, unfortunately.
> > >> (I also regret some of the choices I made for .md iterators, but that's
> > >> another story.)
> > >>
> > >> Perhaps an alternative to the *8 thing would be "IFN

Re: [PATCH] test: Isolate slp-1.c check of target supports vect_strided5

2023-10-09 Thread Andrew Stubbs


On 07/10/2023 02:04, juzhe.zh...@rivai.ai wrote:

Thanks for reporting it.

I think we may need to change it into:
+ /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 
"vect" { target {! vect_load_lanes } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 
"vect" { target vect_strided5 && vect_load_lanes } } } */


Could you verify it whether it work for you ?


You need an additional set of curly braces in the second line to avoid a 
syntax error message, but I get a pass with that change.


Thanks

Andrew

[COMMITTED] Remove unused get_identity_relation.

2023-10-09 Thread Andrew MacLeod

I added this routine for Aldy when he thought we were going to have to 
add explicit versions for unordered relations.


It seems that with accurate tracking of NANs, we do not need the 
explicit versions in the oracle, so we will not need this identity 
routine to pick the appropriate version of VREL_EQ... as there is only 
one.  As it stands, always returns VREL_EQ, so simply use VREL_EQ in the 
2 calling locations.


Bootstrapped on x86_64-pc-linux-gnu with no regressions. Pushed.

Andrew
From 5ee51119d1345f3f13af784455a4ae466766912b Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Mon, 9 Oct 2023 10:01:11 -0400
Subject: [PATCH 1/2] Remove unused get_identity_relation.

Turns out we didnt need this as there is no unordered relations
managed by the oracle.

	* gimple-range-gori.cc (gori_compute::compute_operand1_range): Do
	not call get_identity_relation.
	(gori_compute::compute_operand2_range): Ditto.
	* value-relation.cc (get_identity_relation): Remove.
	* value-relation.h (get_identity_relation): Remove protyotype.
---
 gcc/gimple-range-gori.cc | 10 ++
 gcc/value-relation.cc| 14 --
 gcc/value-relation.h |  3 ---
 3 files changed, 2 insertions(+), 25 deletions(-)

diff --git a/gcc/gimple-range-gori.cc b/gcc/gimple-range-gori.cc
index 1b5eda43390..887da0ff094 100644
--- a/gcc/gimple-range-gori.cc
+++ b/gcc/gimple-range-gori.cc
@@ -1146,10 +1146,7 @@ gori_compute::compute_operand1_range (vrange &r,
 
   // If op1 == op2, create a new trio for just this call.
   if (op1 == op2 && gimple_range_ssa_p (op1))
-	{
-	  relation_kind k = get_identity_relation (op1, op1_range);
-	  trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), k);
-	}
+	trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), VREL_EQ);
   if (!handler.calc_op1 (r, lhs, op2_range, trio))
 	return false;
 }
@@ -1225,10 +1222,7 @@ gori_compute::compute_operand2_range (vrange &r,
 
   // If op1 == op2, create a new trio for this stmt.
   if (op1 == op2 && gimple_range_ssa_p (op1))
-{
-  relation_kind k = get_identity_relation (op1, op1_range);
-  trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), k);
-}
+trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), VREL_EQ);
   // Intersect with range for op2 based on lhs and op1.
   if (!handler.calc_op2 (r, lhs, op1_range, trio))
 return false;
diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc
index 8fea4aad345..a2ae39692a6 100644
--- a/gcc/value-relation.cc
+++ b/gcc/value-relation.cc
@@ -183,20 +183,6 @@ relation_transitive (relation_kind r1, relation_kind r2)
   return relation_kind (rr_transitive_table[r1][r2]);
 }
 
-// When operands of a statement are identical ssa_names, return the
-// approriate relation between operands for NAME == NAME, given RANGE.
-//
-relation_kind
-get_identity_relation (tree name, vrange &range ATTRIBUTE_UNUSED)
-{
-  // Return VREL_UNEQ when it is supported for floats as appropriate.
-  if (frange::supports_p (TREE_TYPE (name)))
-return VREL_EQ;
-
-  // Otherwise return VREL_EQ.
-  return VREL_EQ;
-}
-
 // This vector maps a relation to the equivalent tree code.
 
 static const tree_code relation_to_code [VREL_LAST] = {
diff --git a/gcc/value-relation.h b/gcc/value-relation.h
index f00f84f93b6..be6e277421b 100644
--- a/gcc/value-relation.h
+++ b/gcc/value-relation.h
@@ -91,9 +91,6 @@ inline bool relation_equiv_p (relation_kind r)
 
 void print_relation (FILE *f, relation_kind rel);
 
-// Return relation for NAME == NAME with RANGE.
-relation_kind get_identity_relation (tree name, vrange &range);
-
 class relation_oracle
 {
 public:
-- 
2.41.0

[COMMITTED] PR tree-optimization/111694 - Ensure float equivalences include + and - zero.

2023-10-09 Thread Andrew MacLeod

When ranger propagates ranges in the on-entry cache, it also check for 
equivalences and incorporates the equivalence into the range for a name 
if it is known.


With floating point values, the equivalence that is generated by 
comparison must also take into account that if the equivalence contains 
zero, both positive and negative zeros could be in the range.


This PR demonstrates that once we establish an equivalence, even though 
we know one value may only have a positive zero, the equivalence may 
have been formed earlier and included a negative zero  This patch 
pessimistically assumes that if the equivalence contains zero, we should 
include both + and - 0 in the equivalence that we utilize.


I audited the other places, and found no other place where this issue 
might arise.  Cache propagation is the only place where we augment the 
range with random equivalences.


Bootstrapped on x86_64-pc-linux-gnu with no regressions. Pushed.

Andrew
From b0892b1fc637fadf14d7016858983bc5776a1e69 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Mon, 9 Oct 2023 10:15:07 -0400
Subject: [PATCH 2/2] Ensure float equivalences include + and - zero.

A floating point equivalence may not properly reflect both signs of
zero, so be pessimsitic and ensure both signs are included.

	PR tree-optimization/111694
	gcc/
	* gimple-range-cache.cc (ranger_cache::fill_block_cache): Adjust
	equivalence range.
	* value-relation.cc (adjust_equivalence_range): New.
	* value-relation.h (adjust_equivalence_range): New prototype.

	gcc/testsuite/
	* gcc.dg/pr111694.c: New.
---
 gcc/gimple-range-cache.cc   |  3 +++
 gcc/testsuite/gcc.dg/pr111694.c | 19 +++
 gcc/value-relation.cc   | 19 +++
 gcc/value-relation.h|  3 +++
 4 files changed, 44 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr111694.c

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index 3c819933c4e..89c0845457d 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -1470,6 +1470,9 @@ ranger_cache::fill_block_cache (tree name, basic_block bb, basic_block def_bb)
 		{
 		  if (rel != VREL_EQ)
 		range_cast (equiv_range, type);
+		  else
+		adjust_equivalence_range (equiv_range);
+
 		  if (block_result.intersect (equiv_range))
 		{
 		  if (DEBUG_RANGE_CACHE)
diff --git a/gcc/testsuite/gcc.dg/pr111694.c b/gcc/testsuite/gcc.dg/pr111694.c
new file mode 100644
index 000..a70b03069dc
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr111694.c
@@ -0,0 +1,19 @@
+/* PR tree-optimization/111009 */
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#define signbit(x) __builtin_signbit(x)
+
+static void test(double l, double r)
+{
+  if (l == r && (signbit(l) || signbit(r)))
+;
+  else
+__builtin_abort();
+}
+
+int main()
+{
+  test(0.0, -0.0);
+}
+
diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc
index a2ae39692a6..0326fe7cde6 100644
--- a/gcc/value-relation.cc
+++ b/gcc/value-relation.cc
@@ -183,6 +183,25 @@ relation_transitive (relation_kind r1, relation_kind r2)
   return relation_kind (rr_transitive_table[r1][r2]);
 }
 
+// When one name is an equivalence of another, ensure the equivalence
+// range is correct.  Specifically for floating point, a +0 is also
+// equivalent to a -0 which may not be reflected.  See PR 111694.
+
+void
+adjust_equivalence_range (vrange &range)
+{
+  if (range.undefined_p () || !is_a (range))
+return;
+
+  frange fr = as_a (range);
+  // If range includes 0 make sure both signs of zero are included.
+  if (fr.contains_p (dconst0) || fr.contains_p (dconstm0))
+{
+  frange zeros (range.type (), dconstm0, dconst0);
+  range.union_ (zeros);
+}
+ }
+
 // This vector maps a relation to the equivalent tree code.
 
 static const tree_code relation_to_code [VREL_LAST] = {
diff --git a/gcc/value-relation.h b/gcc/value-relation.h
index be6e277421b..31d48908678 100644
--- a/gcc/value-relation.h
+++ b/gcc/value-relation.h
@@ -91,6 +91,9 @@ inline bool relation_equiv_p (relation_kind r)
 
 void print_relation (FILE *f, relation_kind rel);
 
+// Adjust range as an equivalence.
+void adjust_equivalence_range (vrange &range);
+
 class relation_oracle
 {
 public:
-- 
2.41.0

[PATCH] MATCH: [PR111679] Add alternative simplification of `a | ((~a) ^ b)`

2023-10-09 Thread Andrew Pinski

So currently we have a simplification for `a | ~(a ^ b)` but
that does not match the case where we had originally `(~a) | (a ^ b)`
so we need to add a new pattern that matches that and uses 
bitwise_inverted_equal_p
that also catches comparisons too.

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

PR tree-optimization/111679

gcc/ChangeLog:

* match.pd (`a | ((~a) ^ b)`): New pattern.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/bitops-5.c: New test.
---
 gcc/match.pd |  8 +++
 gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c | 27 
 2 files changed, 35 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 31bfd8b6b68..49740d189a7 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1350,6 +1350,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   && TYPE_PRECISION (TREE_TYPE (@0)) == 1)
   (bit_ior @0 (bit_xor @1 { build_one_cst (type); }
 
+/* a | ((~a) ^ b)  -->  a | (~b) (alt version of the above 2) */
+(simplify
+ (bit_ior:c @0 (bit_xor:cs @1 @2))
+ (with { bool wascmp; }
+ (if (bitwise_inverted_equal_p (@0, @1, wascmp)
+  && (!wascmp || element_precision (type) == 1))
+  (bit_ior @0 (bit_not @2)
+
 /* (a | b) | (a &^ b)  -->  a | b  */
 (for op (bit_and bit_xor)
  (simplify
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c 
b/gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c
new file mode 100644
index 000..990610e3002
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
+/* PR tree-optimization/111679 */
+
+int f1(int a, int b)
+{
+return (~a) | (a ^ b); // ~(a & b) or (~a) | (~b)
+}
+
+_Bool fb(_Bool c, _Bool d)
+{
+return (!c) | (c ^ d); // ~(c & d) or (~c) | (~d)
+}
+
+_Bool fb1(int x, int y)
+{
+_Bool a = x == 10,  b = y > 100;
+return (!a) | (a ^ b); // ~(a & b) or (~a) | (~b)
+// or (x != 10) | (y <= 100)
+}
+
+/* { dg-final { scan-tree-dump-not   "bit_xor_expr, "   "optimized" } } */
+/* { dg-final { scan-tree-dump-times "bit_not_expr, " 2 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "bit_and_expr, " 2 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "bit_ior_expr, " 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "ne_expr, _\[0-9\]+, x_\[0-9\]+"  1 
"optimized" } } */
+/* { dg-final { scan-tree-dump-times "le_expr, _\[0-9\]+, y_\[0-9\]+"  1 
"optimized" } } */
-- 
2.39.3

Re: [PATCH] use get_range_query to replace get_global_range_query

2023-10-10 Thread Andrew Pinski

On Tue, Oct 10, 2023 at 12:02 AM Richard Biener  wrote:
>
> On Tue, 10 Oct 2023, Jiufu Guo wrote:
>
> > Hi,
> >
> > For "get_global_range_query" SSA_NAME_RANGE_INFO can be queried.
> > For "get_range_query", it could get more context-aware range info.
> > And look at the implementation of "get_range_query",  it returns
> > global range if no local fun info.
> >
> > So, if not quering for SSA_NAME, it would be ok to use get_range_query
> > to replace get_global_range_query.
> >
> > Patch https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630389.html,
> > Uses get_range_query could handle more cases.
> >
> > This patch replaces get_global_range_query by get_range_query for
> > most possible code pieces (but deoes not draft new test cases).
> >
> > Pass bootstrap & regtest on ppc64{,le} and x86_64.
> > Is this ok for trunk.
>
> See below
>
> >
> > BR,
> > Jeff (Jiufu Guo)
> >
> > gcc/ChangeLog:
> >
> >   * builtins.cc (expand_builtin_strnlen): Replace get_global_range_query
> >   by get_range_query.
> >   * fold-const.cc (expr_not_equal_to): Likewise.
> >   * gimple-fold.cc (size_must_be_zero_p): Likewise.
> >   * gimple-range-fold.cc (fur_source::fur_source): Likewise.
> >   * gimple-ssa-warn-access.cc (check_nul_terminated_array): Likewise.
> >   * tree-dfa.cc (get_ref_base_and_extent): Likewise.
> >   * tree-ssa-loop-split.cc (split_at_bb_p): Likewise.
> >   * tree-ssa-loop-unswitch.cc 
> > (evaluate_control_stmt_using_entry_checks):
> >   Likewise.
> >
> > ---
> >  gcc/builtins.cc   | 2 +-
> >  gcc/fold-const.cc | 6 +-
> >  gcc/gimple-fold.cc| 6 ++
> >  gcc/gimple-range-fold.cc  | 4 +---
> >  gcc/gimple-ssa-warn-access.cc | 2 +-
> >  gcc/tree-dfa.cc   | 5 +
> >  gcc/tree-ssa-loop-split.cc| 2 +-
> >  gcc/tree-ssa-loop-unswitch.cc | 2 +-
> >  8 files changed, 9 insertions(+), 20 deletions(-)
> >
> > diff --git a/gcc/builtins.cc b/gcc/builtins.cc
> > index cb90bd03b3e..4e0a77ff8e0 100644
> > --- a/gcc/builtins.cc
> > +++ b/gcc/builtins.cc
> > @@ -3477,7 +3477,7 @@ expand_builtin_strnlen (tree exp, rtx target, 
> > machine_mode target_mode)
> >
> >wide_int min, max;
> >value_range r;
> > -  get_global_range_query ()->range_of_expr (r, bound);
> > +  get_range_query (cfun)->range_of_expr (r, bound);
>
> expand doesn't have a ranger instance so this is a no-op.  I'm unsure
> if it would be safe given we're half GIMPLE, half RTL.  Please leave it
> out.

It definitely does not work and can't as I tried to enable a ranger
instance and it didn't work. I wrote up my experience here:
https://gcc.gnu.org/pipermail/gcc/2023-September/242407.html

Thanks,
Andrew Pinski

>
> >if (r.varying_p () || r.undefined_p ())
> >  return NULL_RTX;
> >min = r.lower_bound ();
> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> > index 4f8561509ff..15134b21b9f 100644
> > --- a/gcc/fold-const.cc
> > +++ b/gcc/fold-const.cc
> > @@ -11056,11 +11056,7 @@ expr_not_equal_to (tree t, const wide_int &w)
> >if (!INTEGRAL_TYPE_P (TREE_TYPE (t)))
> >   return false;
> >
> > -  if (cfun)
> > - get_range_query (cfun)->range_of_expr (vr, t);
> > -  else
> > - get_global_range_query ()->range_of_expr (vr, t);
> > -
> > +  get_range_query (cfun)->range_of_expr (vr, t);
>
> These kind of changes look obvious.
>
> >if (!vr.undefined_p () && !vr.contains_p (w))
> >   return true;
> >/* If T has some known zero bits and W has any of those bits set,
> > diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
> > index dc89975270c..853edd9e5d4 100644
> > --- a/gcc/gimple-fold.cc
> > +++ b/gcc/gimple-fold.cc
> > @@ -876,10 +876,8 @@ size_must_be_zero_p (tree size)
> >wide_int zero = wi::zero (TYPE_PRECISION (type));
> >value_range valid_range (type, zero, ssize_max);
> >value_range vr;
> > -  if (cfun)
> > -get_range_query (cfun)->range_of_expr (vr, size);
> > -  else
> > -get_global_range_query ()->range_of_expr (vr, size);
> > +  get_range_query (cfun)->range_of_expr (vr, size);
> > +
> >if (vr.undefined_p ())
> >  vr.set_varying (TREE_TYPE (size));
> >vr.intersect (valid_range);
> > diff --git a/gcc/gimple-range-fold.cc b/gcc/gimple-ran

Re: [PATCH] RISC-V Regression: Fix FAIL of bb-slp-pr65935.c for RVV

2023-10-10 Thread Andrew Stubbs


On 10/10/2023 02:39, Juzhe-Zhong wrote:

Here is the reference comparing dump IR between ARM SVE and RVV.

https://godbolt.org/z/zqess8Gss

We can see RVV has one more dump IR:
optimized: basic block part vectorized using 128 byte vectors
since RVV has 1024 bit vectors.

The codegen is reasonable good.

However, I saw GCN also has 1024 bit vector.
This patch may cause this case FAIL in GCN port ?

Hi, GCN folk, could you check this patch in GCN port for me ?


This patch *fixes* an existing test fail on GCN. :)

It's probably one of the many I've never had time to analyze (and 
optimizing more than expected makes it low priority).


LGTM

Andrew

Re: [committed] [PR target/93062] RISC-V: Handle long conditional branches for RISC-V

2023-10-10 Thread Andrew Waterman

I remembered another concern since we discussed this patch privately.
Using ra for long calls results in a sequence that will corrupt the
return-address stack.  Corrupting the RAS is potentially more costly
than mispredicting a branch, since it can result in a cascading
sequence of mispredictions as the program returns up the stack.  Of
course, if these long calls are dynamically quite rare, this isn't the
end of the world.  But it's always preferable to use a register other
than ra or t0 to avoid this performance reduction.  I know nothing
about the complexity of register scavenging, but it would be nice to
opportunistically use a scratch register (other than t0), falling back
to ra only when necessary.

Tangentially, I noticed the patch uses `jump label, ra' for far
branches but uses `call label' for far jumps.  These corrupt the RAS
in opposite ways (the former pops the RAS and the latter pushes it.
Any reason for using a different sequence in one than the other?

On Tue, Oct 10, 2023 at 3:11 PM Jeff Law  wrote:
>
>
> Ventana has had a variant of this patch from Andrew W. in its tree for
> at least a year.   I'm dusting it off and submitting it on Andrew's behalf.
>
> There's multiple approaches we could be using here.
>
> First we could make $ra fixed and use it as the scratch register for the
> long branch sequences.
>
> Second, we could add a match_scratch to all the conditional branch
> patterns and allow the register allocator to assign the scratch register
> from the pool of GPRs.
>
> Third we could do register scavenging.  This can usually work, though it
> can get complex in some scenarios.
>
> Forth we could use trampolines for extended reach.
>
> Andrew's original patch did a bit of the first approach (make $ra fixed)
> and mostly the second approach.  The net is it was probably the worst in
> terms of impacting code generation -- we lost a register *and* forced
> every branch instruction to get a scratch register allocated.
>
> I had expected the second approach to produce better code than the
> first, but that wasn't actually the case in practice.  It's probably a
> combination of allocating a GPR at every branch point (even with a life
> of a single insn, there's a cost) and perhaps the additional operands on
> conditional branches spoiling simplistic pattern matching in one or more
> passes.
>
> In addition to performing better based on dynamic instruction counts,
> the first approach is significantly simpler to implement.  Given those
> two positives, that's what I've chosen to go with.  Yes it does remove
> $ra from the set of registers available, but the impact of that is *tiny*.
>
> If someone wanted to dive into one of the other approaches to address a
> real world impact, that's great.  If that happens I would strongly
> suggest also evaluating perlbench from spec2017.  It seems particularly
> sensitive to this issue in terms of approach #2's impact on code generation.
>
> I've built & regression tested this variant on the vt1 configuration
> without regressions.  Earlier versions have been bootstrapped as well.
>
> Pushed to the trunk,
>
> Jeff
>

[PATCH] MATCH: [PR111282] Simplify `a & (b ^ ~a)` to `a & b`

2023-10-10 Thread Andrew Pinski

While `a & (b ^ ~a)` is optimized to `a & b` on the rtl level,
it is always good to optimize this at the gimple level and allows
us to match a few extra things including where a is a comparison.

Note I had to update/change the testcase and-1.c to avoid matching
this case as we can match -2 and 1 as bitwise inversions.

PR tree-optimization/111282

gcc/ChangeLog:

* match.pd (`a & ~(a ^ b)`, `a & (a == b)`,
`a & ((~a) ^ b)`): New patterns.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/and-1.c: Update testcase to avoid
matching `~1 & (a ^ 1)` simplification.
* gcc.dg/tree-ssa/bitops-6.c: New test.
---
 gcc/match.pd | 20 ++
 gcc/testsuite/gcc.dg/tree-ssa/and-1.c|  6 ++---
 gcc/testsuite/gcc.dg/tree-ssa/bitops-6.c | 33 
 3 files changed, 56 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/bitops-6.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 49740d189a7..26b05c157c1 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1358,6 +1358,26 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   && (!wascmp || element_precision (type) == 1))
   (bit_ior @0 (bit_not @2)
 
+/* a & ~(a ^ b)  -->  a & b  */
+(simplify
+ (bit_and:c @0 (bit_not (bit_xor:c @0 @1)))
+ (bit_and @0 @1))
+
+/* a & (a == b)  -->  a & b (boolean version of the above). */
+(simplify
+ (bit_and:c @0 (nop_convert? (eq:c @0 @1)))
+ (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
+  && TYPE_PRECISION (TREE_TYPE (@0)) == 1)
+  (bit_and @0 @1)))
+
+/* a & ((~a) ^ b)  -->  a & b (alt version of the above 2) */
+(simplify
+ (bit_and:c @0 (bit_xor:c @1 @2))
+ (with { bool wascmp; }
+ (if (bitwise_inverted_equal_p (@0, @1, wascmp)
+  && (!wascmp || element_precision (type) == 1))
+  (bit_and @0 @2
+
 /* (a | b) | (a &^ b)  -->  a | b  */
 (for op (bit_and bit_xor)
  (simplify
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/and-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/and-1.c
index 276c2b9bd8a..27d38907eea 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/and-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/and-1.c
@@ -2,10 +2,10 @@
 /* { dg-options "-O -fdump-tree-optimized-raw" } */
 
 int f(int in) {
-  in = in | 3;
-  in = in ^ 1;
+  in = in | 7;
+  in = in ^ 3;
   in = (in & ~(unsigned long)1);
   return in;
 }
 
-/* { dg-final { scan-tree-dump-not "bit_and_expr" "optimized" } } */
+/* { dg-final { scan-tree-dump-not "bit_and_expr, "  "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/bitops-6.c 
b/gcc/testsuite/gcc.dg/tree-ssa/bitops-6.c
new file mode 100644
index 000..e6ab2fd6c71
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/bitops-6.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
+/* PR tree-optimization/111282 */
+
+
+int f(int a, int b)
+{
+  return a & (b ^ ~a); // a & b
+}
+
+_Bool fb(_Bool x, _Bool y)
+{
+  return x & (y ^ !x); // x & y
+}
+
+int fa(int w, int z)
+{
+  return (~w) & (w ^ z); // ~w & z
+}
+
+int fcmp(int x, int y)
+{
+  _Bool a = x == 2;
+  _Bool b = y == 1;
+  return a & (b ^ !a); // (x == 2) & (y == 1)
+}
+
+/* { dg-final { scan-tree-dump-not   "bit_xor_expr, "   "optimized" } } */
+/* { dg-final { scan-tree-dump-times "bit_and_expr, " 4 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "bit_not_expr, " 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-not   "ne_expr, ""optimized" } } */
+/* { dg-final { scan-tree-dump-times "eq_expr, "  2 "optimized" } } */
+
-- 
2.39.3

Re: [committed] [PR target/93062] RISC-V: Handle long conditional branches for RISC-V

2023-10-11 Thread Andrew Waterman

On Tue, Oct 10, 2023 at 8:26 PM Jeff Law  wrote:
>
>
>
> On 10/10/23 18:24, Andrew Waterman wrote:
> > I remembered another concern since we discussed this patch privately.
> > Using ra for long calls results in a sequence that will corrupt the
> > return-address stack.
> Yup.  We've actually got data on that internally, it's not showing up in
> a significant way in practice.
>
>
>I know nothing
> > about the complexity of register scavenging, but it would be nice to
> > opportunistically use a scratch register (other than t0), falling back
> > to ra only when necessary.
> The nice thing about making $ra fixed is some can add a register
> scavenging approach, then fall back to $ra if they're unable to find a
> register to reuse.
>
> >
> > Tangentially, I noticed the patch uses `jump label, ra' for far
> > branches but uses `call label' for far jumps.  These corrupt the RAS
> > in opposite ways (the former pops the RAS and the latter pushes it.
> > Any reason for using a different sequence in one than the other?
> I'd noticed it as well -- that's the way it was in the patch that was
> already in Ventana's tree ;-)  My plan was to address that separately
> after dropping in enough infrastructure to allow me to force everything
> to be far branches for testing purposes.

Sounds like we're thinking many of the same thoughts... thanks for
dragging this patch towards the finish line!

>
> jeff

[COMMITTED][GCC13] PR tree-optimization/111694 - Ensure float equivalences include + and - zero.

2023-10-11 Thread Andrew MacLeod

Similar patch which was checked into trunk last week.   slight tweak 
needed as dconstm0 was not exported in gcc 13, otherwise functionally 
the same


Bootstrapped on x86_64-pc-linux-gnu.  pushed.

Andrew
commit f0efc4b25cba1bd35b08b7dfbab0f8fc81b55c66
Author: Andrew MacLeod 
Date:   Mon Oct 9 13:40:15 2023 -0400

Ensure float equivalences include + and - zero.

A floating point equivalence may not properly reflect both signs of
zero, so be pessimsitic and ensure both signs are included.

PR tree-optimization/111694
gcc/
* gimple-range-cache.cc (ranger_cache::fill_block_cache): Adjust
equivalence range.
* value-relation.cc (adjust_equivalence_range): New.
* value-relation.h (adjust_equivalence_range): New prototype.

gcc/testsuite/
* gcc.dg/pr111694.c: New.

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index 2314478d558..e4e75943632 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -1258,6 +1258,9 @@ ranger_cache::fill_block_cache (tree name, basic_block bb, basic_block def_bb)
 		{
 		  if (rel != VREL_EQ)
 		range_cast (equiv_range, type);
+		  else
+		adjust_equivalence_range (equiv_range);
+
 		  if (block_result.intersect (equiv_range))
 		{
 		  if (DEBUG_RANGE_CACHE)
diff --git a/gcc/testsuite/gcc.dg/pr111694.c b/gcc/testsuite/gcc.dg/pr111694.c
new file mode 100644
index 000..a70b03069dc
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr111694.c
@@ -0,0 +1,19 @@
+/* PR tree-optimization/111009 */
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#define signbit(x) __builtin_signbit(x)
+
+static void test(double l, double r)
+{
+  if (l == r && (signbit(l) || signbit(r)))
+;
+  else
+__builtin_abort();
+}
+
+int main()
+{
+  test(0.0, -0.0);
+}
+
diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc
index 30a02d3c9d3..fc792a4d5bc 100644
--- a/gcc/value-relation.cc
+++ b/gcc/value-relation.cc
@@ -183,6 +183,25 @@ relation_transitive (relation_kind r1, relation_kind r2)
   return relation_kind (rr_transitive_table[r1][r2]);
 }
 
+// When one name is an equivalence of another, ensure the equivalence
+// range is correct.  Specifically for floating point, a +0 is also
+// equivalent to a -0 which may not be reflected.  See PR 111694.
+
+void
+adjust_equivalence_range (vrange &range)
+{
+  if (range.undefined_p () || !is_a (range))
+return;
+
+  frange fr = as_a (range);
+  REAL_VALUE_TYPE dconstm0 = dconst0;
+  dconstm0.sign = 1;
+  frange zeros (range.type (), dconstm0, dconst0);
+  // If range includes a 0 make sure both signs of zero are included.
+  if (fr.intersect (zeros) && !fr.undefined_p ())
+range.union_ (zeros);
+ }
+
 // This vector maps a relation to the equivalent tree code.
 
 static const tree_code relation_to_code [VREL_LAST] = {
diff --git a/gcc/value-relation.h b/gcc/value-relation.h
index 3177ecb1ad0..6412cbbe98b 100644
--- a/gcc/value-relation.h
+++ b/gcc/value-relation.h
@@ -91,6 +91,9 @@ inline bool relation_equiv_p (relation_kind r)
 
 void print_relation (FILE *f, relation_kind rel);
 
+// Adjust range as an equivalence.
+void adjust_equivalence_range (vrange &range);
+
 class relation_oracle
 {
 public:

Re: RISC-V: Support CORE-V XCVMAC and XCVALU extensions

2023-10-11 Thread Andrew Pinski

On Wed, Oct 11, 2023 at 6:01 PM juzhe.zh...@rivai.ai
 wrote:
>
> ../../../../gcc/gcc/doc/extend.texi:21708: warning: node next `RISC-V Vector 
> Intrinsics' in menu `CORE-V Built-in Functions' and in sectioning `RX 
> Built-in Functions' differ
> ../../../../gcc/gcc/doc/extend.texi:21716: warning: node `RX Built-in 
> Functions' is next for `CORE-V Built-in Functions' in menu but not in 
> sectioning
> ../../../../gcc/gcc/doc/extend.texi:21716: warning: node `RISC-V Vector 
> Intrinsics' is prev for `CORE-V Built-in Functions' in menu but not in 
> sectioning
> ../../../../gcc/gcc/doc/extend.texi:21716: warning: node up `CORE-V Built-in 
> Functions' in menu `Target Builtins' and in sectioning `RISC-V Vector 
> Intrinsics' differ
> ../../../../gcc/gcc/doc/extend.texi:21708: node `RISC-V Vector Intrinsics' 
> lacks menu item for `CORE-V Built-in Functions' despite being its Up target
> ../../../../gcc/gcc/doc/extend.texi:21889: warning: node prev `RX Built-in 
> Functions' in menu `CORE-V Built-in Functions' and in sectioning `RISC-V 
> Vector Intrinsics' differ
> In file included from ../../../../gcc/gcc/gensupport.cc:26:0:
> ../../../../gcc/gcc/rtl.h:66:26: warning: ‘rtx_def::code’ is too small to 
> hold all values of ‘enum rtx_code’
>  #define RTX_CODE_BITSIZE 8
>   ^
> ../../../../gcc/gcc/rtl.h:318:33: note: in expansion of macro 
> ‘RTX_CODE_BITSIZE’
>ENUM_BITFIELD(rtx_code) code: RTX_CODE_BITSIZE;
>  ^~~~
>
> make[2]: *** [Makefile:3534: doc/gcc.info] Error 1
> make[2]: *** Waiting for unfinished jobs
> rm gfdl.pod gcc.pod gcov-dump.pod gcov-tool.pod fsf-funding.pod gpl.pod 
> cpp.pod gcov.pod lto-dump.pod
> make[2]: Leaving directory 
> '/work/home/jzzhong/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/build-gcc-newlib-stage1/gcc'
> make[1]: *** [Makefile:4648: all-gcc] Error 2
> make[1]: Leaving directory 
> '/work/home/jzzhong/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/build-gcc-newlib-stage1'
> make: *** [Makefile:590: stamps/build-gcc-newlib-stage1] Error 2

This is also recorded as
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111777 . It breaks more
than just RISCV; it depends on the version of texinfo that is
installed too.

Thanks,
Andrew

>
> 
> juzhe.zh...@rivai.ai

[COMMITTED] PR tree-optimization/111622 - Do not add partial equivalences with no uses.

2023-10-13 Thread Andrew MacLeod

Technically PR 111622 exposes a bug in GCC 13, but its been papered over 
on trunk by this:


commit 9ea74d235c7e7816b996a17c61288f02ef767985
Author: Richard Biener 
Date:   Thu Sep 14 09:31:23 2023 +0200

tree-optimization/111294 - better DCE after forwprop


This removes a lot of dead statements, but those statements were being 
added to the list of partial equivalences and causing some serious 
compile time issues.


Rangers cache loops through equivalences when its propagating on-entry 
values, so if the partial equivalence list is very large, it can consume 
a lot of time.  Typically, partial equivalence lists are small.   In 
this case, a lot of dead stmts were not removed, so there was no 
redundancy elimination and it was causing an issue.   This patch 
actually speeds things up a hair in the normal case too.


Bootstrapped on x86_64-pc-linux-gnu with no regressions.  pushed.

Andrew

[COMMITTED] [GCC13] PR tree-optimization/111622 - Do not add partial equivalences with no uses.

2023-10-13 Thread Andrew MacLeod

There are a lot of dead statements in this testcase which a casts. These 
were being added to the list of partial equivalences and causing some 
serious compile time issues.


Rangers cache loops through equivalences when its propagating on-entry 
values, so if the partial equivalence list is very large, it can consume 
a lot of time.  Typically, partial equivalence lists are small.   In 
this case, a lot of dead stmts were not removed, so there was no 
redundancy elimination and it was causing an issue.


Bootstrapped on x86_64-pc-linux-gnu with no regressions. Pushed.

Andrew
From 425964b77ab5b9631e914965a7397303215c77a1 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Thu, 12 Oct 2023 17:06:36 -0400
Subject: [PATCH] Do not add partial equivalences with no uses.

	PR tree-optimization/111622
	* value-relation.cc (equiv_oracle::add_partial_equiv): Do not
	register a partial equivalence if an operand has no uses.
---
 gcc/value-relation.cc | 9 +
 1 file changed, 9 insertions(+)

diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc
index fc792a4d5bc..0ed5f93d184 100644
--- a/gcc/value-relation.cc
+++ b/gcc/value-relation.cc
@@ -389,6 +389,9 @@ equiv_oracle::add_partial_equiv (relation_kind r, tree op1, tree op2)
   // In either case, if PE2 has an entry, we simply do nothing.
   if (pe2.members)
 	return;
+  // If there are no uses of op2, do not register.
+  if (has_zero_uses (op2))
+	return;
   // PE1 is the LHS and already has members, so everything in the set
   // should be a slice of PE2 rather than PE1.
   pe2.code = pe_min (r, pe1.code);
@@ -406,6 +409,9 @@ equiv_oracle::add_partial_equiv (relation_kind r, tree op1, tree op2)
 }
   if (pe2.members)
 {
+  // If there are no uses of op1, do not register.
+  if (has_zero_uses (op1))
+	return;
   pe1.ssa_base = pe2.ssa_base;
   // If pe2 is a 16 bit value, but only an 8 bit copy, we can't be any
   // more than an 8 bit equivalence here, so choose MIN value.
@@ -415,6 +421,9 @@ equiv_oracle::add_partial_equiv (relation_kind r, tree op1, tree op2)
 }
   else
 {
+  // If there are no uses of either operand, do not register.
+  if (has_zero_uses (op1) || has_zero_uses (op2))
+	return;
   // Neither name has an entry, simply create op1 as slice of op2.
   pe2.code = bits_to_pe (TYPE_PRECISION (TREE_TYPE (op2)));
   if (pe2.code == VREL_VARYING)
-- 
2.41.0

Re: [COMMITTED] PR tree-optimization/111622 - Do not add partial equivalences with no uses.

2023-10-13 Thread Andrew MacLeod


of course the patch would be handy...


On 10/13/23 09:23, Andrew MacLeod wrote:
Technically PR 111622 exposes a bug in GCC 13, but its been papered 
over on trunk by this:


commit 9ea74d235c7e7816b996a17c61288f02ef767985
Author: Richard Biener 
Date:   Thu Sep 14 09:31:23 2023 +0200
        tree-optimization/111294 - better DCE after forwprop

This removes a lot of dead statements, but those statements were being 
added to the list of partial equivalences and causing some serious 
compile time issues.


Rangers cache loops through equivalences when its propagating on-entry 
values, so if the partial equivalence list is very large, it can 
consume a lot of time.  Typically, partial equivalence lists are 
small.   In this case, a lot of dead stmts were not removed, so there 
was no redundancy elimination and it was causing an issue.   This 
patch actually speeds things up a hair in the normal case too.


Bootstrapped on x86_64-pc-linux-gnu with no regressions.  pushed.

Andrew


From 4eea3c1872a941089cafa105a11d8e40b1a55929 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Thu, 12 Oct 2023 17:06:36 -0400
Subject: [PATCH] Do not add partial equivalences with no uses.

	PR tree-optimization/111622
	* value-relation.cc (equiv_oracle::add_partial_equiv): Do not
	register a partial equivalence if an operand has no uses.
---
 gcc/value-relation.cc | 9 +
 1 file changed, 9 insertions(+)

diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc
index 0326fe7cde6..c0f513a0eb1 100644
--- a/gcc/value-relation.cc
+++ b/gcc/value-relation.cc
@@ -392,6 +392,9 @@ equiv_oracle::add_partial_equiv (relation_kind r, tree op1, tree op2)
   // In either case, if PE2 has an entry, we simply do nothing.
   if (pe2.members)
 	return;
+  // If there are no uses of op2, do not register.
+  if (has_zero_uses (op2))
+	return;
   // PE1 is the LHS and already has members, so everything in the set
   // should be a slice of PE2 rather than PE1.
   pe2.code = pe_min (r, pe1.code);
@@ -409,6 +412,9 @@ equiv_oracle::add_partial_equiv (relation_kind r, tree op1, tree op2)
 }
   if (pe2.members)
 {
+  // If there are no uses of op1, do not register.
+  if (has_zero_uses (op1))
+	return;
   pe1.ssa_base = pe2.ssa_base;
   // If pe2 is a 16 bit value, but only an 8 bit copy, we can't be any
   // more than an 8 bit equivalence here, so choose MIN value.
@@ -418,6 +424,9 @@ equiv_oracle::add_partial_equiv (relation_kind r, tree op1, tree op2)
 }
   else
 {
+  // If there are no uses of either operand, do not register.
+  if (has_zero_uses (op1) || has_zero_uses (op2))
+	return;
   // Neither name has an entry, simply create op1 as slice of op2.
   pe2.code = bits_to_pe (TYPE_PRECISION (TREE_TYPE (op2)));
   if (pe2.code == VREL_VARYING)
-- 
2.41.0

[PATCH] MATCH: [PR111432] Simplify `a & (x | CST)` to a when we know that (a & ~CST) == 0

2023-10-13 Thread Andrew Pinski

This adds the simplification `a & (x | CST)` to a when we know that
`(a & ~CST) == 0`. In a similar fashion as `a & CST` is handle.

I looked into handling `a | (x & CST)` but that I don't see any decent
simplifications happening.

OK? Bootstrapped and tested on x86_linux-gnu with no regressions.

PR tree-optimization/111432

gcc/ChangeLog:

* match.pd (`a & (x | CST)`): New pattern.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/bitops-7.c: New test.
---
 gcc/match.pd |  8 
 gcc/testsuite/gcc.dg/tree-ssa/bitops-7.c | 24 
 2 files changed, 32 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/bitops-7.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 51e5065d086..45624f3dcb4 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1550,6 +1550,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
   && wi::bit_and_not (get_nonzero_bits (@0), wi::to_wide (@1)) == 0)
   @0))
+
+/* `a & (x | CST)` -> a if we know that (a & ~CST) == 0   */
+(simplify
+ (bit_and:c SSA_NAME@0 (bit_ior @1 INTEGER_CST@2))
+ (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
+  && wi::bit_and_not (get_nonzero_bits (@0), wi::to_wide (@2)) == 0)
+  @0))
+
 /* x | C -> C if we know that x & ~C == 0.  */
 (simplify
  (bit_ior SSA_NAME@0 INTEGER_CST@1)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/bitops-7.c 
b/gcc/testsuite/gcc.dg/tree-ssa/bitops-7.c
new file mode 100644
index 000..7fb18db3a11
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/bitops-7.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O1 -fdump-tree-optimized-raw" } */
+/* PR tree-optimization/111432 */
+
+int
+foo3(int c, int bb)
+{
+  if ((bb & ~3)!=0) __builtin_unreachable();
+  return (bb & (c|3));
+}
+
+int
+foo_bool(int c, _Bool bb)
+{
+  return (bb & (c|7));
+}
+
+/* Both of these functions should be able to remove the `IOR` and `AND`
+   as the only bits that are non-zero for bb is set on the other side
+   of the `AND`.
+ */
+
+/* { dg-final { scan-tree-dump-not   "bit_ior_expr, "   "optimized" } } */
+/* { dg-final { scan-tree-dump-not   "bit_and_expr, "   "optimized" } } */
-- 
2.39.3

[PATCH 2/2] [c] Fix PR 101364: ICE after error due to diagnose_arglist_conflict not checking for error

2023-10-14 Thread Andrew Pinski

When checking to see if we have a function declaration has a conflict due to
promotations, there is no test to see if the type was an error mark and then 
calls
c_type_promotes_to. c_type_promotes_to is not ready for error_mark and causes an
ICE.

This adds a check for error before the call of c_type_promotes_to.

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

PR c/101364

gcc/c/ChangeLog:

* c-decl.cc (diagnose_arglist_conflict): Test for
error mark before calling of c_type_promotes_to.

gcc/testsuite/ChangeLog:

* gcc.dg/pr101364-1.c: New test.
---
 gcc/c/c-decl.cc   | 3 ++-
 gcc/testsuite/gcc.dg/pr101364-1.c | 8 
 2 files changed, 10 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr101364-1.c

diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc
index 5822faf01b4..eb2df08c0a7 100644
--- a/gcc/c/c-decl.cc
+++ b/gcc/c/c-decl.cc
@@ -1899,7 +1899,8 @@ diagnose_arglist_conflict (tree newdecl, tree olddecl,
  break;
}
 
-  if (c_type_promotes_to (type) != type)
+  if (!error_operand_p (type)
+ && c_type_promotes_to (type) != type)
{
  inform (input_location, "an argument type that has a default "
  "promotion cannot match an empty parameter name list "
diff --git a/gcc/testsuite/gcc.dg/pr101364-1.c 
b/gcc/testsuite/gcc.dg/pr101364-1.c
new file mode 100644
index 000..e7c94a05553
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr101364-1.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-std=c90 "} */
+
+void fruit(); /* { dg-message "previous declaration" } */
+void fruit( /* { dg-error "conflicting types for" } */
+int b[x], /* { dg-error "undeclared " } */
+short c)
+{} /* { dg-message "an argument type that has a" } */
-- 
2.39.3

[PATCH 1/2] Fix ICE due to c_safe_arg_type_equiv_p not checking for error_mark node

2023-10-14 Thread Andrew Pinski

This is a simple error recovery issue when c_safe_arg_type_equiv_p
was added in r8-5312-gc65e18d3331aa999. The issue is that after
an error, an argument type (of a function type) might turn
into an error mark node and c_safe_arg_type_equiv_p was not ready
for that. So this just adds a check for error operand for its
arguments before getting the main variant.

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

PR c/101285

gcc/c/ChangeLog:

* c-typeck.cc (c_safe_arg_type_equiv_p): Return true for error
operands early.

gcc/testsuite/ChangeLog:

* gcc.dg/pr101285-1.c: New test.
---
 gcc/c/c-typeck.cc |  3 +++
 gcc/testsuite/gcc.dg/pr101285-1.c | 10 ++
 2 files changed, 13 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr101285-1.c

diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc
index e55e887da14..6e044b4afbc 100644
--- a/gcc/c/c-typeck.cc
+++ b/gcc/c/c-typeck.cc
@@ -5960,6 +5960,9 @@ handle_warn_cast_qual (location_t loc, tree type, tree 
otype)
 static bool
 c_safe_arg_type_equiv_p (tree t1, tree t2)
 {
+  if (error_operand_p (t1) || error_operand_p (t2))
+return true;
+
   t1 = TYPE_MAIN_VARIANT (t1);
   t2 = TYPE_MAIN_VARIANT (t2);
 
diff --git a/gcc/testsuite/gcc.dg/pr101285-1.c 
b/gcc/testsuite/gcc.dg/pr101285-1.c
new file mode 100644
index 000..831e35f7662
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr101285-1.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-W -Wall" } */
+const int b;
+typedef void (*ft1)(int[b++]); /* { dg-error "read-only variable" } */
+void bar(int * z);
+void baz()
+{
+(ft1) bar; /* { dg-warning "statement with no effect" } */
+}
+
-- 
2.39.3

[PATCH] MATCH: Improve `A CMP 0 ? A : -A` set of patterns to use bitwise_equal_p.

2023-10-15 Thread Andrew Pinski

This improves the `A CMP 0 ? A : -A` set of match patterns to use
bitwise_equal_p which allows an nop cast between signed and unsigned.
This allows catching a few extra cases which were not being caught before.

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

gcc/ChangeLog:

PR tree-optimization/101541
* match.pd (A CMP 0 ? A : -A): Improve
using bitwise_equal_p.

gcc/testsuite/ChangeLog:

PR tree-optimization/101541
* gcc.dg/tree-ssa/phi-opt-36.c: New test.
* gcc.dg/tree-ssa/phi-opt-37.c: New test.
---
 gcc/match.pd   | 49 -
 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-36.c | 51 ++
 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-37.c | 24 ++
 3 files changed, 104 insertions(+), 20 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-36.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-37.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 45624f3dcb4..142e2dfbeb1 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -5668,42 +5668,51 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  /* A == 0 ? A : -Asame as -A */
  (for cmp (eq uneq)
   (simplify
-   (cnd (cmp @0 zerop) @0 (negate@1 @0))
-(if (!HONOR_SIGNED_ZEROS (type))
+   (cnd (cmp @0 zerop) @2 (negate@1 @2))
+(if (!HONOR_SIGNED_ZEROS (type)
+&& bitwise_equal_p (@0, @2))
  @1))
   (simplify
-   (cnd (cmp @0 zerop) zerop (negate@1 @0))
-(if (!HONOR_SIGNED_ZEROS (type))
+   (cnd (cmp @0 zerop) zerop (negate@1 @2))
+(if (!HONOR_SIGNED_ZEROS (type)
+&& bitwise_equal_p (@0, @2))
  @1))
  )
  /* A != 0 ? A : -Asame as A */
  (for cmp (ne ltgt)
   (simplify
-   (cnd (cmp @0 zerop) @0 (negate @0))
-(if (!HONOR_SIGNED_ZEROS (type))
- @0))
+   (cnd (cmp @0 zerop) @1 (negate @1))
+(if (!HONOR_SIGNED_ZEROS (type)
+&& bitwise_equal_p (@0, @1))
+ @1))
   (simplify
-   (cnd (cmp @0 zerop) @0 integer_zerop)
-(if (!HONOR_SIGNED_ZEROS (type))
- @0))
+   (cnd (cmp @0 zerop) @1 integer_zerop)
+(if (!HONOR_SIGNED_ZEROS (type)
+&& bitwise_equal_p (@0, @1))
+ @1))
  )
  /* A >=/> 0 ? A : -Asame as abs (A) */
  (for cmp (ge gt)
   (simplify
-   (cnd (cmp @0 zerop) @0 (negate @0))
-(if (!HONOR_SIGNED_ZEROS (type)
-&& !TYPE_UNSIGNED (type))
- (abs @0
+   (cnd (cmp @0 zerop) @1 (negate @1))
+(if (!HONOR_SIGNED_ZEROS (TREE_TYPE(@0))
+&& !TYPE_UNSIGNED (TREE_TYPE(@0))
+&& bitwise_equal_p (@0, @1))
+ (if (TYPE_UNSIGNED (type))
+  (absu:type @0)
+  (abs @0)
  /* A <=/< 0 ? A : -Asame as -abs (A) */
  (for cmp (le lt)
   (simplify
-   (cnd (cmp @0 zerop) @0 (negate @0))
-(if (!HONOR_SIGNED_ZEROS (type)
-&& !TYPE_UNSIGNED (type))
- (if (ANY_INTEGRAL_TYPE_P (type)
- && !TYPE_OVERFLOW_WRAPS (type))
+   (cnd (cmp @0 zerop) @1 (negate @1))
+(if (!HONOR_SIGNED_ZEROS (TREE_TYPE(@0))
+&& !TYPE_UNSIGNED (TREE_TYPE(@0))
+&& bitwise_equal_p (@0, @1))
+ (if ((ANY_INTEGRAL_TYPE_P (TREE_TYPE (@0))
+  && !TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
+ || TYPE_UNSIGNED (type))
   (with {
-   tree utype = unsigned_type_for (type);
+   tree utype = unsigned_type_for (TREE_TYPE(@0));
}
(convert (negate (absu:utype @0
(negate (abs @0)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-36.c 
b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-36.c
new file mode 100644
index 000..4baf9f82a22
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-36.c
@@ -0,0 +1,51 @@
+/* { dg-options "-O2 -fdump-tree-phiopt" } */
+
+unsigned f0(int A)
+{
+  unsigned t = A;
+// A == 0? A : -Asame as -A
+  if (A == 0)  return t;
+  return -t;
+}
+
+unsigned f1(int A)
+{
+  unsigned t = A;
+// A != 0? A : -Asame as A
+  if (A != 0)  return t;
+  return -t;
+}
+unsigned f2(int A)
+{
+  unsigned t = A;
+// A >= 0? A : -Asame as abs (A)
+  if (A >= 0)  return t;
+  return -t;
+}
+unsigned f3(int A)
+{
+  unsigned t = A;
+// A > 0?  A : -Asame as abs (A)
+  if (A > 0)  return t;
+  return -t;
+}
+unsigned f4(int A)
+{
+  unsigned t = A;
+// A <= 0? A : -Asame as -abs (A)
+  if (A <= 0)  return t;
+  return -t;
+}
+unsigned f5(int A)
+{
+  unsigned t = A;
+// A < 0?  A : -Asame as -abs (A)
+  if (A < 0)  return t;
+  return -t;
+}
+
+/* f4 and f5 are not allowed to be optimized in early phi-opt. */
+/* { dg-final { scan-tree-dump-times "if " 2 "phiopt1" } } */
+/* { dg-final { scan-tree-dump-not "if " "phiopt2" } } */
+
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-37.c 
b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-37.c
new file mode 100644
index 000..f1ff472aaff
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-37.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O1 -fdump-tree-phiopt1" } */
+
+unsigned abs_with_convert0 (int x)
+{
+unsigned int y = x

[PATCH] Improve factor_out_conditional_operation for conversions and constants

2023-10-15 Thread Andrew Pinski

In the case of a NOP conversion (precisions of the 2 types are equal),
factoring out the conversion can be done even if int_fits_type_p returns
false and even when the conversion is defined by a statement inside the
conditional. Since it is a NOP conversion there is no zero/sign extending
happening which is why it is ok to be done here.

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

gcc/ChangeLog:

PR tree-optimization/104376
PR tree-optimization/101541
* tree-ssa-phiopt.cc (factor_out_conditional_operation):
Allow nop conversions even if it is defined by a statement
inside the conditional.

gcc/testsuite/ChangeLog:

PR tree-optimization/101541
* gcc.dg/tree-ssa/phi-opt-38.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-38.c | 44 ++
 gcc/tree-ssa-phiopt.cc |  8 +++-
 2 files changed, 50 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-38.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-38.c 
b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-38.c
new file mode 100644
index 000..ca04d1619e6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-38.c
@@ -0,0 +1,44 @@
+/* { dg-options "-O2 -fdump-tree-phiopt" } */
+
+unsigned f0(int A)
+{
+// A == 0? A : -Asame as -A
+  if (A == 0)  return A;
+  return -A;
+}
+
+unsigned f1(int A)
+{
+// A != 0? A : -Asame as A
+  if (A != 0)  return A;
+  return -A;
+}
+unsigned f2(int A)
+{
+// A >= 0? A : -Asame as abs (A)
+  if (A >= 0)  return A;
+  return -A;
+}
+unsigned f3(int A)
+{
+// A > 0?  A : -Asame as abs (A)
+  if (A > 0)  return A;
+  return -A;
+}
+unsigned f4(int A)
+{
+// A <= 0? A : -Asame as -abs (A)
+  if (A <= 0)  return A;
+  return -A;
+}
+unsigned f5(int A)
+{
+// A < 0?  A : -Asame as -abs (A)
+  if (A < 0)  return A;
+  return -A;
+}
+
+/* f4 and f5 are not allowed to be optimized in early phi-opt. */
+/* { dg-final { scan-tree-dump-times "if" 2 "phiopt1" } } */
+/* { dg-final { scan-tree-dump-not "if" "phiopt2" } } */
+
diff --git a/gcc/tree-ssa-phiopt.cc b/gcc/tree-ssa-phiopt.cc
index 312a6f9082b..0ab8fad5898 100644
--- a/gcc/tree-ssa-phiopt.cc
+++ b/gcc/tree-ssa-phiopt.cc
@@ -310,7 +310,9 @@ factor_out_conditional_operation (edge e0, edge e1, gphi 
*phi,
return NULL;
   /* If arg1 is an INTEGER_CST, fold it to new type.  */
   if (INTEGRAL_TYPE_P (TREE_TYPE (new_arg0))
- && int_fits_type_p (arg1, TREE_TYPE (new_arg0)))
+ && (int_fits_type_p (arg1, TREE_TYPE (new_arg0))
+ || TYPE_PRECISION (TREE_TYPE (new_arg0))
+ == TYPE_PRECISION (TREE_TYPE (arg1
{
  if (gimple_assign_cast_p (arg0_def_stmt))
{
@@ -323,7 +325,9 @@ factor_out_conditional_operation (edge e0, edge e1, gphi 
*phi,
 its basic block, because then it is possible this
 could enable further optimizations (minmax replacement
 etc.).  See PR71016.  */
- if (new_arg0 != gimple_cond_lhs (cond_stmt)
+ if (TYPE_PRECISION (TREE_TYPE (new_arg0))
+   != TYPE_PRECISION (TREE_TYPE (arg1))
+ && new_arg0 != gimple_cond_lhs (cond_stmt)
  && new_arg0 != gimple_cond_rhs (cond_stmt)
  && gimple_bb (arg0_def_stmt) == e0->src)
{
-- 
2.34.1

[PATCH] [PR31531] MATCH: Improve ~a < ~b and ~a < CST, allow a nop cast inbetween ~ and a/b

2023-10-15 Thread Andrew Pinski

Currently we able to simplify `~a CMP ~b` to `b CMP a` but we should allow a nop
conversion in between the `~` and the `a` which can show up. A similarly thing 
should
be done for `~a CMP CST`.

I had originally submitted the `~a CMP CST` case as
https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585088.html;
I noticed we should do the same thing for the `~a CMP ~b` case and combined
it with that one here.

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

PR tree-optimization/31531

gcc/ChangeLog:

* match.pd (~X op ~Y): Allow for an optional nop convert.
(~X op C): Likewise.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/pr31531-1.c: New test.
* gcc.dg/tree-ssa/pr31531-2.c: New test.
---
 gcc/match.pd  | 10 ---
 gcc/testsuite/gcc.dg/tree-ssa/pr31531-1.c | 19 +
 gcc/testsuite/gcc.dg/tree-ssa/pr31531-2.c | 34 +++
 3 files changed, 59 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr31531-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr31531-2.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 51e5065d086..e76ec1ec034 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -5944,18 +5944,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 /* Fold ~X op ~Y as Y op X.  */
 (for cmp (simple_comparison)
  (simplify
-  (cmp (bit_not@2 @0) (bit_not@3 @1))
+  (cmp (nop_convert1?@4 (bit_not@2 @0)) (nop_convert2? (bit_not@3 @1)))
   (if (single_use (@2) && single_use (@3))
-   (cmp @1 @0
+   (with { tree otype = TREE_TYPE (@4); }
+(cmp (convert:otype @1) (convert:otype @0))
 
 /* Fold ~X op C as X op' ~C, where op' is the swapped comparison.  */
 (for cmp (simple_comparison)
  scmp (swapped_simple_comparison)
  (simplify
-  (cmp (bit_not@2 @0) CONSTANT_CLASS_P@1)
+  (cmp (nop_convert? (bit_not@2 @0)) CONSTANT_CLASS_P@1)
   (if (single_use (@2)
&& (TREE_CODE (@1) == INTEGER_CST || TREE_CODE (@1) == VECTOR_CST))
-   (scmp @0 (bit_not @1)
+   (with { tree otype = TREE_TYPE (@1); }
+(scmp (convert:otype @0) (bit_not @1))
 
 (for cmp (simple_comparison)
  (simplify
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr31531-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr31531-1.c
new file mode 100644
index 000..c27299151eb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr31531-1.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+/* PR tree-optimization/31531 */
+
+int f(int a)
+{
+  int b = ~a;
+  return b<0;
+}
+
+
+int f1(unsigned a)
+{
+  int b = ~a;
+  return b<0;
+}
+/* We should convert the above two functions from b <0 to ((int)a) >= 0. */
+/* { dg-final { scan-tree-dump-times ">= 0" 2 "optimized"} } */
+/* { dg-final { scan-tree-dump-times "~" 0 "optimized"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr31531-2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr31531-2.c
new file mode 100644
index 000..865ea292215
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr31531-2.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+/* PR tree-optimization/31531 */
+
+int f0(unsigned x, unsigned t)
+{
+x = ~x;
+t = ~t;
+int xx = x;
+int tt = t;
+return tt < xx;
+}
+
+int f1(unsigned x, int t)
+{
+x = ~x;
+t = ~t;
+int xx = x;
+int tt = t;
+return tt < xx;
+}
+
+int f2(int x, unsigned t)
+{
+x = ~x;
+t = ~t;
+int xx = x;
+int tt = t;
+return tt < xx;
+}
+
+
+/* We should be able to remove all ~ from the above functions. */
+/* { dg-final { scan-tree-dump-times "~" 0 "optimized"} } */
-- 
2.39.3

Re: [PATCH] Add files to discourage submissions of PRs to the GitHub mirror.

2023-10-16 Thread Andrew Pinski

On Mon, Oct 16, 2023, 16:39 Eric Gallager  wrote:

> Currently there is an unofficial mirror of GCC on GitHub that people
> sometimes submit pull requests to:
> https://github.com/gcc-mirror/gcc
> However, this is not the proper way to contribute to GCC, so that means
> that someone (usually Jonathan Wakely) has to go through the PRs and
> manually tell people that they're sending their PRs to the wrong place.
> One thing that would help mitigate this problem would be files in a
> special .github directory that GitHub would automatically open when
> contributors attempt to open a PR, that would then tell them the proper
> way to contribute instead. This patch attempts to add two such files.
> They are written in Markdown, which I'm realizing might require some
> special handling in this repository, since the ".md" extension is also
> used for GCC's "Machine Description" files here, but I'm not quite sure
> how to go about handling that. Also note that I adapted these files from
> equivalent files in the git repository for Git itself:
> https://github.com/git/git/blob/master/.github/CONTRIBUTING.md
> https://github.com/git/git/blob/master/.github/PULL_REQUEST_TEMPLATE.md
> What do people think?
>


I think this is a great idea. Is a similar one for opening issues too?

Thanks,
Andrew


ChangeLog:
>
> * .github/CONTRIBUTING.md: New file.
> * .github/PULL_REQUEST_TEMPLATE.md: New file.
> ---
>  .github/CONTRIBUTING.md  | 18 ++
>  .github/PULL_REQUEST_TEMPLATE.md |  5 +
>  2 files changed, 23 insertions(+)
>  create mode 100644 .github/CONTRIBUTING.md
>  create mode 100644 .github/PULL_REQUEST_TEMPLATE.md
>
> diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
> new file mode 100644
> index ..4f7b3abca5f4
> --- /dev/null
> +++ b/.github/CONTRIBUTING.md
> @@ -0,0 +1,18 @@
> +## Contributing to GCC
> +
> +Thanks for taking the time to contribute to GCC! Please be advised that
> if you are
> +viewing this on `github.com`, that the mirror there is unofficial and
> unmonitored.
> +The GCC community does not use `github.com` for their contributions.
> Instead, we use
> +a mailing list (`gcc-patches@gcc.gnu.org`) for code submissions, code
> +reviews, and bug reports.
> +
> +Perhaps one day it will be possible to use [GitGitGadget](
> https://gitgitgadget.github.io/) to
> +conveniently send Pull Requests commits to GCC's mailing list, the way
> that the Git project currently allows it to be used to send PRs to their
> mailing list, but until that day arrives, please send your patches to the
> mailing list manually.
> +
> +Please read ["Contributing to GCC"](https://gcc.gnu.org/contribute.html)
> on the main GCC website
> +to learn how the GCC project is managed, and how you can work with it.
> +In addition, we highly recommend you to read [our guidelines for
> read-write Git access](https://gcc.gnu.org/gitwrite.html).
> +
> +Or, you can follow the ["Contributing to GCC in 10 easy steps"](
> https://gcc.gnu.org/wiki/GettingStarted#Basics:_Contributing_to_GCC_in_10_easy_steps)
> section of the ["Getting Started" page](
> https://gcc.gnu.org/wiki/GettingStarted) on [the wiki](
> https://gcc.gnu.org/wiki) for another example of the contribution process.
> +
> +Your friendly GCC community!
> diff --git a/.github/PULL_REQUEST_TEMPLATE.md
> b/.github/PULL_REQUEST_TEMPLATE.md
> new file mode 100644
> index ..6417392c8cf3
> --- /dev/null
> +++ b/.github/PULL_REQUEST_TEMPLATE.md
> @@ -0,0 +1,5 @@
> +Thanks for taking the time to contribute to GCC! Please be advised that
> if you are
> +viewing this on `github.com`, that the mirror there is unofficial and
> unmonitored.
> +The GCC community does not use `github.com` for their contributions.
> Instead, we use
> +a mailing list (`gcc-patches@gcc.gnu.org`) for code submissions, code
> reviews, and
> +bug reports. Please send patches there instead.
>

Re: [PATCH 11/11] aarch64: Add new load/store pair fusion pass.

2023-10-17 Thread Andrew Pinski

On Tue, Oct 17, 2023 at 1:52 PM Alex Coplan  wrote:
>
> This adds a new aarch64-specific RTL-SSA pass dedicated to forming load
> and store pairs (LDPs and STPs).
>
> As a motivating example for the kind of thing this improves, take the
> following testcase:
>
> extern double c[20];
>
> double f(double x)
> {
>   double y = x*x;
>   y += c[16];
>   y += c[17];
>   y += c[18];
>   y += c[19];
>   return y;
> }
>
> for which we currently generate (at -O2):
>
> f:
> adrpx0, c
> add x0, x0, :lo12:c
> ldp d31, d29, [x0, 128]
> ldr d30, [x0, 144]
> fmadd   d0, d0, d0, d31
> ldr d31, [x0, 152]
> faddd0, d0, d29
> faddd0, d0, d30
> faddd0, d0, d31
> ret
>
> but with the pass, we generate:
>
> f:
> .LFB0:
> adrpx0, c
> add x0, x0, :lo12:c
> ldp d31, d29, [x0, 128]
> fmadd   d0, d0, d0, d31
> ldp d30, d31, [x0, 144]
> faddd0, d0, d29
> faddd0, d0, d30
> faddd0, d0, d31
> ret
>
> The pass is local (only considers a BB at a time).  In theory, it should
> be possible to extend it to run over EBBs, at least in the case of pure
> (MEM_READONLY_P) loads, but this is left for future work.
>
> The pass works by identifying two kinds of bases: tree decls obtained
> via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos.
> If a candidate memory access has a MEM_EXPR base, then we track it via
> this base, and otherwise if it is of a simple reg +  form, we track
> it via the RTL-SSA def_info for the register.
>
> For each BB, for a given kind of base, we build up a hash table mapping
> the base to an access_group.  The access_group data structure holds a
> list of accesses at each offset relative to the same base.  It uses a
> splay tree to support efficient insertion (while walking the bb), and
> the nodes are chained using a linked list to support efficient
> iteration (while doing the transformation).
>
> For each base, we then iterate over the access_group to identify
> adjacent accesses, and try to form load/store pairs for those insns that
> access adjacent memory.
>
> The pass is currently run twice, both before and after register
> allocation.  The first copy of the pass is run late in the pre-RA RTL
> pipeline, immediately after sched1, since it was found that sched1 was
> increasing register pressure when the pass was run before.  The second
> copy of the pass runs immediately before peephole2, so as to get any
> opportunities that the existing ldp/stp peepholes can handle.
>
> There are some cases that we punt on before RA, e.g.
> accesses relative to eliminable regs (such as the soft frame pointer).
> We do this since we can't know the elimination offset before RA, and we
> want to avoid the RA reloading the offset (due to being out of ldp/stp
> immediate range) as this can generate worse code.
>
> The post-RA copy of the pass is there to pick up the crumbs that were
> left behind / things we punted on in the pre-RA pass.  Among other
> things, it's needed to handle accesses relative to the stack pointer
> (see the previous patch in the series for an example).  It can also
> handle code that didn't exist at the time the pre-RA pass was run (spill
> code, prologue/epilogue code).
>
> The following table shows the effect of the passes on code size in
> SPEC CPU 2017 with -Os -flto=auto -mcpu=neoverse-v1:
>
> +-+-+--+-+
> |Benchmark| Pre-RA pass | Post-RA pass | Overall |
> +-+-+--+-+
> | 541.leela_r | 0.04%   | -0.03%   | 0.01%   |
> | 502.gcc_r   | -0.07%  | -0.02%   | -0.09%  |
> | 510.parest_r| -0.06%  | -0.04%   | -0.10%  |
> | 505.mcf_r   | -0.12%  | 0.00%| -0.12%  |
> | 500.perlbench_r | -0.12%  | -0.02%   | -0.15%  |
> | 520.omnetpp_r   | -0.13%  | -0.03%   | -0.16%  |
> | 538.imagick_r   | -0.17%  | -0.02%   | -0.19%  |
> | 525.x264_r  | -0.17%  | -0.02%   | -0.19%  |
> | 544.nab_r   | -0.22%  | -0.01%   | -0.23%  |
> | 557.xz_r| -0.27%  | -0.01%   | -0.28%  |
> | 507.cactuBSSN_r | -0.26%  | -0.03%   | -0.29%  |
> | 526.blender_r   | -0.37%  | -0.02%   | -0.38%  |
> | 523.xalancbmk_r | -0.41%  | -0.01%   | -0.42%  |
> | 531.deepsjeng_r | -0.41%  | -0.05%   | -0.46%  |
> | 511.povray_r| -0.60%  | -0.05%   | -0.65%  |
> | 548.exchange2_r | -0.55%  | -0.32%   | -0.86%  |
> | 527.cam4_r  | -0.82%  | -0.16%   | -0.98%  |
> | 503.bwaves_r| -0.63%  | -0.41%   | -1.04%  |
> | 521.wrf_r   | -1.04%  | -0.06%   | -1.10%  |
> | 549.fotonik3d_r | -0.91%  | -0.35%   | -1.26%  |
> | 554.roms_r  | -1.20%  | -0.20%   | -1.40%  |
> | 519.lbm_r   | -1.91%  | 0.00%| -1

aarch64: Replace duplicated selftests

Pushed as obvious.

gcc/ChangeLog:

* config/aarch64/aarch64.cc (aarch64_test_fractional_cost):
Test <= instead of testing < twice.


diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
2b0de7ca0389be6698c329b54f9501b8ec09183f..9c3c0e705e2e6ea3b55b4a5f1e7d3360f91eb51d
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -27529,18 +27529,18 @@ aarch64_test_fractional_cost ()
   ASSERT_EQ (cf (2, 3) * 5, cf (10, 3));
   ASSERT_EQ (14 * cf (11, 21), cf (22, 3));
 
-  ASSERT_TRUE (cf (4, 15) < cf (5, 15));
-  ASSERT_FALSE (cf (5, 15) < cf (5, 15));
-  ASSERT_FALSE (cf (6, 15) < cf (5, 15));
-  ASSERT_TRUE (cf (1, 3) < cf (2, 5));
-  ASSERT_TRUE (cf (1, 12) < cf (1, 6));
-  ASSERT_FALSE (cf (5, 3) < cf (5, 3));
-  ASSERT_TRUE (cf (239, 240) < 1);
-  ASSERT_FALSE (cf (240, 240) < 1);
-  ASSERT_FALSE (cf (241, 240) < 1);
-  ASSERT_FALSE (2 < cf (207, 104));
-  ASSERT_FALSE (2 < cf (208, 104));
-  ASSERT_TRUE (2 < cf (209, 104));
+  ASSERT_TRUE (cf (4, 15) <= cf (5, 15));
+  ASSERT_TRUE (cf (5, 15) <= cf (5, 15));
+  ASSERT_FALSE (cf (6, 15) <= cf (5, 15));
+  ASSERT_TRUE (cf (1, 3) <= cf (2, 5));
+  ASSERT_TRUE (cf (1, 12) <= cf (1, 6));
+  ASSERT_TRUE (cf (5, 3) <= cf (5, 3));
+  ASSERT_TRUE (cf (239, 240) <= 1);
+  ASSERT_TRUE (cf (240, 240) <= 1);
+  ASSERT_FALSE (cf (241, 240) <= 1);
+  ASSERT_FALSE (2 <= cf (207, 104));
+  ASSERT_TRUE (2 <= cf (208, 104));
+  ASSERT_TRUE (2 <= cf (209, 104));
 
   ASSERT_TRUE (cf (4, 15) < cf (5, 15));
   ASSERT_FALSE (cf (5, 15) < cf (5, 15));

[0/3] target_version and aarch64 function multiversioning

This series adds support for function multiversioning on aarch64.  There are a
few minor issues in patch 2/3, that I intend to fix in future versions or
follow-up patches.  I also have some open questions about the correctness of
existing function multiversioning implementations [1], that could affect some
details of this patch series.

Patches 1/3 and 2/3 both pass regression testing on x86.  Patch 2/3 requires
adding function multiversioning tests to aarch64, which I haven't included yet.
Patch 3/3 demonstrates a potential approach for improving consistency of symbol
naming between target_clones and target/target_version multiversioning, but
would require agreement on how to resolve some of the issues discussed in [1].

Thanks,
Andrew


[1] https://gcc.gnu.org/pipermail/gcc/2023-October/242686.html

[1/3] Add support for target_version attribute

This patch adds support for the "target_version" attribute to the middle
end and the C++ frontend, which will be used to implement function
multiversioning in the aarch64 backend.

Note that C++ is currently the only frontend which supports
multiversioning using the "target" attribute, whereas the
"target_clones" attribute is additionally supported in C, D and Ada.
Support for the target_version attribute will be extended to C at a
later date.

Targets that currently use the "target" attribute for function
multiversioning (i.e. i386 and rs6000) are not affected by this patch.


I could have implemented the target hooks slightly differently, by reusing the
valid_attribute_p hook and adding attribute name checks to each backend
implementation (c.f. the aarch64 implementation in patch 2/3).  Would this be
preferable?

Otherwise, is this ok for master?


gcc/c-family/ChangeLog:

* c-attribs.cc (handle_target_version_attribute): New.
(c_common_attribute_table): Add target_version.
(handle_target_clones_attribute): Add conflict with
target_version attribute.

gcc/ChangeLog:

* attribs.cc (is_function_default_version): Update comment to
specify incompatibility with target_version attributes.
* cgraphclones.cc (cgraph_node::create_version_clone_with_body):
Call valid_version_attribute_p for target_version attributes.
* target.def (valid_version_attribute_p): New hook.
(expanded_clones_attribute): New hook.
* doc/tm.texi.in: Add new hooks.
* doc/tm.texi: Regenerate.
* multiple_target.cc (create_dispatcher_calls): Remove redundant
is_function_default_version check.
(expand_target_clones): Use target hook for attribute name.
* targhooks.cc (default_target_option_valid_version_attribute_p):
New.
* targhooks.h (default_target_option_valid_version_attribute_p):
New.
* tree.h (DECL_FUNCTION_VERSIONED): Update comment to include
target_version attributes.

gcc/cp/ChangeLog:

* decl2.cc (check_classfn): Update comment to include
target_version attributes.


diff --git a/gcc/attribs.cc b/gcc/attribs.cc
index 
b1300018d1e8ed8e02ded1ea721dc192a6d32a49..a3c4a81e8582ea4fd06b9518bf51fad7c998ddd6
 100644
--- a/gcc/attribs.cc
+++ b/gcc/attribs.cc
@@ -1233,8 +1233,9 @@ make_dispatcher_decl (const tree decl)
   return func_decl;  
 }
 
-/* Returns true if decl is multi-versioned and DECL is the default function,
-   that is it is not tagged with target specific optimization.  */
+/* Returns true if DECL is multi-versioned using the target attribute, and this
+   is the default version.  This function can only be used for targets that do
+   not support the "target_version" attribute.  */
 
 bool
 is_function_default_version (const tree decl)
diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc
index 
072cfb69147bd6b314459c0bd48a0c1fb92d3e4d..1a224c036277d51ab4dc0d33a403177bd226e48a
 100644
--- a/gcc/c-family/c-attribs.cc
+++ b/gcc/c-family/c-attribs.cc
@@ -148,6 +148,7 @@ static tree handle_alloc_align_attribute (tree *, tree, 
tree, int, bool *);
 static tree handle_assume_aligned_attribute (tree *, tree, tree, int, bool *);
 static tree handle_assume_attribute (tree *, tree, tree, int, bool *);
 static tree handle_target_attribute (tree *, tree, tree, int, bool *);
+static tree handle_target_version_attribute (tree *, tree, tree, int, bool *);
 static tree handle_target_clones_attribute (tree *, tree, tree, int, bool *);
 static tree handle_optimize_attribute (tree *, tree, tree, int, bool *);
 static tree ignore_attribute (tree *, tree, tree, int, bool *);
@@ -480,6 +481,8 @@ const struct attribute_spec c_common_attribute_table[] =
  handle_error_attribute, NULL },
   { "target", 1, -1, true, false, false, false,
  handle_target_attribute, NULL },
+  { "target_version", 1, -1, true, false, false, false,
+ handle_target_version_attribute, NULL },
   { "target_clones",  1, -1, true, false, false, false,
  handle_target_clones_attribute, NULL },
   { "optimize",   1, -1, true, false, false, false,
@@ -5569,6 +5572,45 @@ handle_target_attribute (tree *node, tree name, tree 
args, int flags,
   return NULL_TREE;
 }
 
+/* Handle a "target_version" attribute.  */
+
+static tree
+handle_target_version_attribute (tree *node, tree name, tree args, int flags,
+ bool *no_add_attrs)
+{
+  /* Ensure we have a function type.  */
+  if (TREE_CODE (*node) != FUNCTION_DECL)
+{
+  warning (OPT_Wattributes, "%qE attribute ignored", name);
+  *no_add_attrs = true;
+}
+  else if (lookup_attribute ("target_clones", DECL_ATTRIBUTES (*node)))
+{
+  warning (OPT_Wattributes, "%qE attribute ignored due to conflict "
+  "with %qs attribute

[2/3] [aarch64] Add function multiversioning support

This adds initial support for function multiversion on aarch64 using the
target_version and target_clones attributes. This mostly follows the
Beta specification in the ACLE [1], with a few diffences that remain to
be fixed:

- Symbol mangling for target_clones differs from that for target_version
  and does not match the mangling specified in the ACLE. This
  inconsistency is also present in i386 and rs6000 mangling.
- The target_clones attribute does not currently support an implicit
  "default" version.
- Unrecognised target names in a target_clones attribute should be
  ignored (with an optional warning), but currently cause an error to be
  raised instead.
- There is no option to disable function multiversioning at compile
  time.
- There is no support for function multiversioning in C, since this is
  not yet enabled in the frontend. On the other hand, this patch
  happens to enable multiversioning in Ada and D as well, using their
  existing frontend support.

This patch relies on adding functionality to libgcc, to support:
- struct { unsigned long long features; } __aarch64_cpu_features;
- void __init_cpu_features (void);
- void __init_cpu_features_resolver (unsigned long hwcap,
 const __ifunc_arg_t *arg);
This support matches the interface currently used in LLVM's compiler-rt,
and will be implemented in a future patch (which will be merged before
merging this patch).

This version of the patch incorrectly uses __init_cpu_features in the
ifunc resolvers, which could lead to invalid library calls at load time.
I will fix this to use __init_cpu_features_resolver in a future version
of the patch.

[1] 
https://github.com/ARM-software/acle/blob/main/main/acle.md#function-multi-versioning

gcc/ChangeLog:

* attribs.cc (decl_attributes): Pass attribute name to target
hook.
* config/aarch64/aarch64.cc
(aarch64_process_target_version_attr): New.
(aarch64_option_valid_attribute_p): Add check and support for
target_version attribute.
(enum CPUFeatures): New list of for bitmask positions.
(aarch64_fmv_feature_data): New.
(get_feature_bit): New.
(get_feature_mask_for_version): New.
(compare_feature_masks): New.
(aarch64_compare_version_priority): New.
(make_resolver_func): New.
(add_condition_to_bb): New.
(compare_feature_version_info): New.
(dispatch_function_versions): New.
(aarch64_generate_version_dispatcher_body): New.
(aarch64_get_function_versions_dispatcher): New.
(aarch64_common_function_versions): New.
(aarch64_mangle_decl_assembler_name): New.
(TARGET_OPTION_VALID_VERSION_ATTRIBUTE_P): New implementation.
(TARGET_OPTION_EXPANDED_CLONES_ATTRIBUTE): New implementation.
(TARGET_OPTION_FUNCTION_VERSIONS): New implementation.
(TARGET_COMPARE_VERSION_PRIORITY): New implementation.
(TARGET_GENERATE_VERSION_DISPATCHER_BODY): New implementation.
(TARGET_GET_FUNCTION_VERSIONS_DISPATCHER): New implementation.
(TARGET_MANGLE_DECL_ASSEMBLER_NAME): New implementation.


diff --git a/gcc/attribs.cc b/gcc/attribs.cc
index 
a3c4a81e8582ea4fd06b9518bf51fad7c998ddd6..cc935b502028392ebdc105f940900f01f79196a7
 100644
--- a/gcc/attribs.cc
+++ b/gcc/attribs.cc
@@ -657,7 +657,8 @@ decl_attributes (tree *node, tree attributes, int flags,
  options to the attribute((target(...))) list.  */
   if (TREE_CODE (*node) == FUNCTION_DECL
   && current_target_pragma
-  && targetm.target_option.valid_attribute_p (*node, NULL_TREE,
+  && targetm.target_option.valid_attribute_p (*node,
+ get_identifier("target"),
  current_target_pragma, 0))
 {
   tree cur_attr = lookup_attribute ("target", attributes);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
9c3c0e705e2e6ea3b55b4a5f1e7d3360f91eb51d..ca0e2a2507ffdbf99e17b77240504bf2d175b9c0
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -19088,11 +19088,70 @@ aarch64_process_target_attr (tree args)
   return true;
 }
 
+/* Parse the tree in ARGS that contains the targeti_version attribute
+   information and update the global target options space.  */
+
+bool
+aarch64_process_target_version_attr (tree args)
+{
+  if (TREE_CODE (args) == TREE_LIST)
+{
+  if (TREE_CHAIN (args))
+   {
+ error ("attribute % has multiple values");
+ return false;
+   }
+  args = TREE_VALUE (args);
+}
+
+  if (!args || TREE_CODE (args) != STRING_CST)
+{
+  error ("attribute % argument not a string");
+  return false;
+}
+
+  const char *str = TREE_STRING_POINTER (args);
+  if (strcmp (str, "default") == 0)
+return true;
+
+  auto with_plus = std::string ("+") + str;
+  enum aarch_parse_opt_result parse_res;
+  auto isa_flags

[3/3] WIP/RFC: Fix name mangling for target_clones

This is a partial patch to make the mangling of function version names
for target_clones match those generated using the target or
target_version attributes.  It modifies the name of function versions,
but does not yet rename the resolved symbol, resulting in a duplicate
symbol name (and an error at assembly time).


Is this sort of approach ok?  Should I create an extra target hook to be called
here, so that the target_clones mangling can be target-specific but not
necessarily the same as for target attribute versioning?


diff --git a/gcc/cgraphclones.cc b/gcc/cgraphclones.cc
index 
8af6b23d8c0306920e0fdcb3559ef047a16689f4..15672c02c6f9d6043a36bf081067f08d1ab834e5
 100644
--- a/gcc/cgraphclones.cc
+++ b/gcc/cgraphclones.cc
@@ -1033,11 +1033,6 @@ cgraph_node::create_version_clone_with_body
   else
 new_decl = copy_node (old_decl);
 
-  /* Generate a new name for the new version. */
-  tree fnname = (version_decl ? clone_function_name_numbered (old_decl, suffix)
-   : clone_function_name (old_decl, suffix));
-  DECL_NAME (new_decl) = fnname;
-  SET_DECL_ASSEMBLER_NAME (new_decl, fnname);
   SET_DECL_RTL (new_decl, NULL);
 
   DECL_VIRTUAL_P (new_decl) = 0;
@@ -1065,6 +1060,24 @@ cgraph_node::create_version_clone_with_body
return NULL;
 }
 
+  /* Generate a new name for the new version. */
+  if (version_decl)
+{
+  tree fnname = (clone_function_name_numbered (old_decl, suffix));
+  DECL_NAME (new_decl) = fnname;
+  SET_DECL_ASSEMBLER_NAME (new_decl, fnname);
+}
+  else
+{
+  /* Add target version mangling.  We assume that the target hook will
+produce the same mangled name as it would have produced if the decl
+had already been versioned when the hook was previously called.  */
+  tree fnname = DECL_ASSEMBLER_NAME (old_decl);
+  DECL_NAME (new_decl) = fnname;
+  fnname = targetm.mangle_decl_assembler_name (new_decl, fnname);
+  SET_DECL_ASSEMBLER_NAME (new_decl, fnname);
+}
+
   /* When the old decl was a con-/destructor make sure the clone isn't.  */
   DECL_STATIC_CONSTRUCTOR (new_decl) = 0;
   DECL_STATIC_DESTRUCTOR (new_decl) = 0;
diff --git a/gcc/multiple_target.cc b/gcc/multiple_target.cc
index 
3db57c2b13d612a37240d9dcf58ad21b2286633c..d9aec9a5ab532701b4a1877b440f3a553ffa28e2
 100644
--- a/gcc/multiple_target.cc
+++ b/gcc/multiple_target.cc
@@ -162,7 +162,12 @@ create_dispatcher_calls (struct cgraph_node *node)
}
 }
 
-  tree fname = clone_function_name (node->decl, "default");
+  /* Add version mangling to default decl name.  We assume that the target
+ hook will produce the same mangled name as it would have produced if the
+ decl had already been versioned when the hook was previously called.  */
+  tree fname = DECL_ASSEMBLER_NAME (node->decl);
+  DECL_NAME (node->decl) = fname;
+  fname = targetm.mangle_decl_assembler_name (node->decl, fname);
   symtab->change_decl_assembler_name (node->decl, fname);
 
   if (node->definition)

[COMMITTED] Fix expansion of `(a & 2) != 1`

2023-10-18 Thread Andrew Pinski

I had a thinko in r14-1600-ge60593f3881c72a96a3fa4844d73e8a2cd14f670
where we would remove the `& CST` part if we ended up not calling
expand_single_bit_test.
This fixes the problem by introducing a new variable that will be used
for calling expand_single_bit_test.
As afar as I know this can only show up when disabling optimization
passes as this above form would have been optimized away.

Committed as obvious after a bootstrap/test on x86_64-linux-gnu.

PR middle-end/111863

gcc/ChangeLog:

* expr.cc (do_store_flag): Don't over write arg0
when stripping off `& POW2`.

gcc/testsuite/ChangeLog:

* gcc.c-torture/execute/pr111863-1.c: New test.
---
 gcc/expr.cc  |  9 +
 gcc/testsuite/gcc.c-torture/execute/pr111863-1.c | 16 
 2 files changed, 21 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.c-torture/execute/pr111863-1.c

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 8aed3fc6cbe..763bd82c59f 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -13206,14 +13206,15 @@ do_store_flag (sepops ops, rtx target, machine_mode 
mode)
  || integer_pow2p (arg1))
   && (TYPE_PRECISION (ops->type) != 1 || TYPE_UNSIGNED (ops->type)))
 {
-  wide_int nz = tree_nonzero_bits (arg0);
-  gimple *srcstmt = get_def_for_expr (arg0, BIT_AND_EXPR);
+  tree narg0 = arg0;
+  wide_int nz = tree_nonzero_bits (narg0);
+  gimple *srcstmt = get_def_for_expr (narg0, BIT_AND_EXPR);
   /* If the defining statement was (x & POW2), then use that instead of
 the non-zero bits.  */
   if (srcstmt && integer_pow2p (gimple_assign_rhs2 (srcstmt)))
{
  nz = wi::to_wide (gimple_assign_rhs2 (srcstmt));
- arg0 = gimple_assign_rhs1 (srcstmt);
+ narg0 = gimple_assign_rhs1 (srcstmt);
}
 
   if (wi::popcount (nz) == 1
@@ -13227,7 +13228,7 @@ do_store_flag (sepops ops, rtx target, machine_mode 
mode)
 
  type = lang_hooks.types.type_for_mode (mode, unsignedp);
  return expand_single_bit_test (loc, tcode,
-arg0,
+narg0,
 bitnum, type, target, mode);
}
 }
diff --git a/gcc/testsuite/gcc.c-torture/execute/pr111863-1.c 
b/gcc/testsuite/gcc.c-torture/execute/pr111863-1.c
new file mode 100644
index 000..4e27fe631b2
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr111863-1.c
@@ -0,0 +1,16 @@
+/* { dg-options " -fno-tree-ccp -fno-tree-dominator-opts -fno-tree-vrp" } */
+
+__attribute__((noipa))
+int f(int a)
+{
+a &= 2;
+return a != 1;
+}
+int main(void)
+{
+int t = f(1);
+if (!t)
+__builtin_abort();
+__builtin_printf("%d\n",t);
+return 0;
+}
-- 
2.39.3

[PATCH] aarch64: [PR110986] Emit csinv again for `a ? ~b : b`

2023-10-18 Thread Andrew Pinski

After r14-3110-g7fb65f10285, the canonical form for
`a ? ~b : b` changed to be `-(a) ^ b` that means
for aarch64 we need to add a few new insn patterns
to be able to catch this and change it to be
what is the canonical form for the aarch64 backend.
A secondary pattern was needed to support a zero_extended
form too; this adds a testcase for all 3 cases.

Bootstrapped and tested on aarch64-linux-gnu with no regressions.

PR target/110986

gcc/ChangeLog:

* config/aarch64/aarch64.md (*cmov_insn_insv): New pattern.
(*cmov_uxtw_insn_insv): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/cond_op-1.c: New test.
---
 gcc/config/aarch64/aarch64.md| 46 
 gcc/testsuite/gcc.target/aarch64/cond_op-1.c | 20 +
 2 files changed, 66 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/cond_op-1.c

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 32c7adc8928..59cd0415937 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4413,6 +4413,52 @@ (define_insn "*csinv3_uxtw_insn3"
   [(set_attr "type" "csel")]
 )
 
+;; There are two canonical forms for `cmp ? ~a : a`.
+;; This is the second form and is here to help combine.
+;; Support `-(cmp) ^ a` into `cmp ? ~a : a`
+;; The second pattern is to support the zero extend'ed version.
+
+(define_insn_and_split "*cmov_insn_insv"
+  [(set (match_operand:GPI 0 "register_operand" "=r")
+(xor:GPI
+(neg:GPI
+ (match_operator:GPI 1 "aarch64_comparison_operator"
+  [(match_operand 2 "cc_register" "") (const_int 0)]))
+(match_operand:GPI 3 "general_operand" "r")))]
+  "can_create_pseudo_p ()"
+  "#"
+  "&& true"
+  [(set (match_dup 0)
+   (if_then_else:GPI (match_dup 1)
+ (not:GPI (match_dup 3))
+ (match_dup 3)))]
+  {
+operands[3] = force_reg (mode, operands[3]);
+  }
+  [(set_attr "type" "csel")]
+)
+
+(define_insn_and_split "*cmov_uxtw_insn_insv"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+(zero_extend:DI
+(xor:SI
+ (neg:SI
+  (match_operator:SI 1 "aarch64_comparison_operator"
+   [(match_operand 2 "cc_register" "") (const_int 0)]))
+ (match_operand:SI 3 "general_operand" "r"]
+  "can_create_pseudo_p ()"
+  "#"
+  "&& true"
+  [(set (match_dup 0)
+   (if_then_else:DI (match_dup 1)
+ (zero_extend:DI (not:SI (match_dup 3)))
+ (zero_extend:DI (match_dup 3]
+  {
+operands[3] = force_reg (SImode, operands[3]);
+  }
+  [(set_attr "type" "csel")]
+)
+
 ;; If X can be loaded by a single CNT[BHWD] instruction,
 ;;
 ;;A = UMAX (B, X)
diff --git a/gcc/testsuite/gcc.target/aarch64/cond_op-1.c 
b/gcc/testsuite/gcc.target/aarch64/cond_op-1.c
new file mode 100644
index 000..e6c7821127e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/cond_op-1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* PR target/110986 */
+
+
+long long full(unsigned a, unsigned b)
+{
+  return a ? ~b : b;
+}
+unsigned fuu(unsigned a, unsigned b)
+{
+  return a ? ~b : b;
+}
+long long f(unsigned long long a, unsigned long long b)
+{
+  return a ? ~b : b;
+}
+
+/* { dg-final { scan-assembler-times "csinv\tw\[0-9\]*" 2 } } */
+/* { dg-final { scan-assembler-times "csinv\tx\[0-9\]*" 1 } } */
-- 
2.39.3

[committed] amdgcn: deprecate Fiji device and multilib

2023-10-19 Thread Andrew Stubbs

The build has been failing for the last few days because LLVM removed 
support for the HSACOv3 binary metadata format, which we were still 
using for the Fiji multilib.


The LLVM commit has now been reverted (thank you Pierre van Houtryve), 
but it's only a temporary repreive.


This patch removes Fiji from the default configuration, and updates the 
documentation accordingly, but no more.  Those that still use Fiji 
devices can re-enable it by configuring using --with-arch=fiji.


Why not remove Fiji support entirely? This is simply because about one 
third of our test farm conists of Fiji devices and we can't replace them 
quickly.


Andrewamdgcn: deprecate Fiji device and multilib

LLVM wants to remove it, which breaks our build.  This patch means that
most users won't notice that change, when it comes, and those that do will
have chosen to enable Fiji explicitly.

I'm selecting gfx900 as the new default as that's the least likely for users
to want, which means most users will specify -march explicitly, which means
we'll be free to change the default again, when we need to, without breaking
anybody's makefiles.

gcc/ChangeLog:

* config.gcc (amdgcn): Switch default to --with-arch=gfx900.
Implement support for --with-multilib-list.
* config/gcn/t-gcn-hsa: Likewise.
* doc/install.texi: Likewise.
* doc/invoke.texi: Mark Fiji deprecated.

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 37311fcd075..9c397156868 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -4538,7 +4538,19 @@ case "${target}" in
;;
esac
done
-   [ "x$with_arch" = x ] && with_arch=fiji
+   [ "x$with_arch" = x ] && with_arch=gfx900
+
+   case "x${with_multilib_list}" in
+   x | xno)
+   TM_MULTILIB_CONFIG=
+   ;;
+   xdefault | xyes)
+   TM_MULTILIB_CONFIG=`echo "gfx900,gfx906,gfx908,gfx90a" 
| sed "s/${with_arch},\?//;s/,$//"`
+   ;;
+   *)
+   TM_MULTILIB_CONFIG="${with_multilib_list}"
+   ;;
+   esac
;;
 
hppa*-*-*)
diff --git a/gcc/config/gcn/t-gcn-hsa b/gcc/config/gcn/t-gcn-hsa
index ea27122e484..18db7075356 100644
--- a/gcc/config/gcn/t-gcn-hsa
+++ b/gcc/config/gcn/t-gcn-hsa
@@ -42,8 +42,12 @@ ALL_HOST_OBJS += gcn-run.o
 gcn-run$(exeext): gcn-run.o
+$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ $< -ldl
 
-MULTILIB_OPTIONS = march=gfx900/march=gfx906/march=gfx908/march=gfx90a
-MULTILIB_DIRNAMES = gfx900 gfx906 gfx908 gfx90a
+empty :=
+space := $(empty) $(empty)
+comma := ,
+multilib_list := $(subst $(comma),$(space),$(TM_MULTILIB_CONFIG)) 
+MULTILIB_OPTIONS = $(subst $(space),/,$(addprefix march=,$(multilib_list)))
+MULTILIB_DIRNAMES = $(multilib_list)
 
 gcn-tree.o: $(srcdir)/config/gcn/gcn-tree.cc
$(COMPILE) $<
diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
index 31f2234640f..4035e8020b2 100644
--- a/gcc/doc/install.texi
+++ b/gcc/doc/install.texi
@@ -1236,8 +1236,8 @@ sysv, aix.
 @itemx --without-multilib-list
 Specify what multilibs to build.  @var{list} is a comma separated list of
 values, possibly consisting of a single value.  Currently only implemented
-for aarch64*-*-*, arm*-*-*, loongarch*-*-*, riscv*-*-*, sh*-*-* and
-x86-64-*-linux*.  The accepted values and meaning for each target is given
+for aarch64*-*-*, amdgcn*-*-*, arm*-*-*, loongarch*-*-*, riscv*-*-*, sh*-*-*
+and x86-64-*-linux*.  The accepted values and meaning for each target is given
 below.
 
 @table @code
@@ -1250,6 +1250,15 @@ default run-time library will be built.  If @var{list} is
 default set of libraries is selected based on the value of
 @option{--target}.
 
+@item amdgcn*-*-*
+@var{list} is a comma separated list of ISA names (allowed values: @code{fiji},
+@code{gfx900}, @code{gfx906}, @code{gfx908}, @code{gfx90a}). It ought not
+include the name of the default ISA, specified via @option{--with-arch}.  If
+@var{list} is empty, then there will be no multilibs and only the default
+run-time library will be built.  If @var{list} is @code{default} or
+@option{--with-multilib-list=} is not specified, then the default set of
+libraries is selected.
+
 @item arm*-*-*
 @var{list} is a comma separated list of @code{aprofile} and
 @code{rmprofile} to build multilibs for A or R and M architecture
@@ -3922,6 +3931,12 @@ To run the binaries, install the HSA Runtime from the
 @file{libexec/gcc/amdhsa-amdhsa/@var{version}/gcn-run} to launch them
 on the GPU.
 
+To enable support for GCN3 Fiji devices (gfx803), GCC has to be configured with
+@option{--with-arch=@code{fiji}} or
+@option{--with-multilib-list=@code{fiji},...}.  Note that support for Fiji
+devices has been removed in ROCm 4.0 and support in LLVM is deprecated and will
+be removed in the future.
+
 @html
 
 @end html
diff --git a/gcc/doc/i

[PATCH] wwwdocs: gcc-14: mark amdgcn fiji deprecated

2023-10-19 Thread Andrew Stubbs


OK to commit?

Andrewgcc-14: mark amdgcn fiji deprecated


diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index c817dde4..91ab8132 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -178,6 +178,16 @@ a work-in-progress.
 
 
 
+AMD Radeon (GCN)
+
+
+  The Fiji device support is now deprecated and will be removed from a
+  future release.  The default compiler configuration no longer uses Fiji
+  as the default device, and no longer includes the Fiji libraries.  Both
+  can be restored by configuring with --with-arch=fiji.
+  The default device architecture is now gfx900 (Vega).
+
+

[PATCH] c: [PR104822] Don't warn about converting NULL to different sso endian

2023-10-19 Thread Andrew Pinski

In a similar way we don't warn about NULL pointer constant conversion to
a different named address we should not warn to a different sso endian
either.
This adds the simple check.

Bootstrapped and tested on x86_64-linux-gnu with no regressions.

PR c/104822

gcc/c/ChangeLog:

* c-typeck.cc (convert_for_assignment): Check for null pointer
before warning about an incompatible scalar storage order.

gcc/testsuite/ChangeLog:

* gcc.dg/sso-18.c: New test.
---
 gcc/c/c-typeck.cc |  1 +
 gcc/testsuite/gcc.dg/sso-18.c | 16 
 2 files changed, 17 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/sso-18.c

diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc
index 6e044b4afbc..f39dc71d593 100644
--- a/gcc/c/c-typeck.cc
+++ b/gcc/c/c-typeck.cc
@@ -7449,6 +7449,7 @@ convert_for_assignment (location_t location, location_t 
expr_loc, tree type,
 
   /* See if the pointers point to incompatible scalar storage orders.  */
   if (warn_scalar_storage_order
+ && !null_pointer_constant_p (rhs)
  && (AGGREGATE_TYPE_P (ttl) && TYPE_REVERSE_STORAGE_ORDER (ttl))
 != (AGGREGATE_TYPE_P (ttr) && TYPE_REVERSE_STORAGE_ORDER (ttr)))
{
diff --git a/gcc/testsuite/gcc.dg/sso-18.c b/gcc/testsuite/gcc.dg/sso-18.c
new file mode 100644
index 000..799a0c858f2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/sso-18.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* PR c/104822 */
+
+#include 
+
+struct Sb {
+  int i;
+} __attribute__((scalar_storage_order("big-endian")));
+struct Sl {
+  int i;
+} __attribute__((scalar_storage_order("little-endian")));
+
+/* Neither of these should warn about incompatible scalar storage order
+   as NULL pointers are compatiable with both endian. */
+struct Sb *pb = NULL; /* { dg-bogus "" } */
+struct Sl *pl = NULL; /* { dg-bogus "" } */
-- 
2.39.3

[PATCH] c: [PR100532] Fix ICE when an agrgument was an error mark

2023-10-19 Thread Andrew Pinski

In the case of convert_argument, we would return the same expression
back rather than error_mark_node after the error message about
trying to convert to an incomplete type. This causes issues in
the gimplfier trying to see if another conversion is needed.

The code here dates back to before the revision history too so
it might be the case it never noticed we should return an error_mark_node.

Bootstrapped and tested on x86_64-linux-gnu with no regressions.

PR c/100532

gcc/c/ChangeLog:

* c-typeck.cc (convert_argument): After erroring out
about an incomplete type return error_mark_node.

gcc/testsuite/ChangeLog:

* gcc.dg/pr100532-1.c: New test.
---
 gcc/c/c-typeck.cc | 2 +-
 gcc/testsuite/gcc.dg/pr100532-1.c | 7 +++
 2 files changed, 8 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr100532-1.c

diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc
index 6e044b4afbc..8f8562936dc 100644
--- a/gcc/c/c-typeck.cc
+++ b/gcc/c/c-typeck.cc
@@ -3367,7 +3367,7 @@ convert_argument (location_t ploc, tree function, tree 
fundecl,
 {
   error_at (ploc, "type of formal parameter %d is incomplete",
parmnum + 1);
-  return val;
+  return error_mark_node;
 }
 
   /* Optionally warn about conversions that differ from the default
diff --git a/gcc/testsuite/gcc.dg/pr100532-1.c 
b/gcc/testsuite/gcc.dg/pr100532-1.c
new file mode 100644
index 000..81e37c60415
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr100532-1.c
@@ -0,0 +1,7 @@
+/* { dg-do compile } */
+/* PR c/100532 */
+
+typedef __SIZE_TYPE__ size_t;
+void *memcpy(void[], const void *, size_t); /* { dg-error "declaration of type 
name" } */
+void c(void) { memcpy(c, "a", 2); } /* { dg-error "type of formal parameter" } 
*/
+
-- 
2.34.1

Re: [1/3] Add support for target_version attribute

2023-10-19 Thread Andrew Carlotti

On Thu, Oct 19, 2023 at 07:04:09AM +, Richard Biener wrote:
> On Wed, 18 Oct 2023, Andrew Carlotti wrote:
> 
> > This patch adds support for the "target_version" attribute to the middle
> > end and the C++ frontend, which will be used to implement function
> > multiversioning in the aarch64 backend.
> > 
> > Note that C++ is currently the only frontend which supports
> > multiversioning using the "target" attribute, whereas the
> > "target_clones" attribute is additionally supported in C, D and Ada.
> > Support for the target_version attribute will be extended to C at a
> > later date.
> > 
> > Targets that currently use the "target" attribute for function
> > multiversioning (i.e. i386 and rs6000) are not affected by this patch.
> > 
> > 
> > I could have implemented the target hooks slightly differently, by reusing 
> > the
> > valid_attribute_p hook and adding attribute name checks to each backend
> > implementation (c.f. the aarch64 implementation in patch 2/3).  Would this 
> > be
> > preferable?
> > 
> > Otherwise, is this ok for master?
> 
> This lacks user-level documentation in doc/extend.texi (where
> target_clones is documented).

Good point.  I'll add documentation updates as a separate patch in the series
(rather than documenting the state after this patch, in which the attribute is
supported on zero targets).  I think the existing documentation for target and
target_clones needs some improvement as well.

> Was there any discussion/description of why target_clones cannot
> be made work for aarch64?
> 
> Richard.

The second patch in this series does include support for target_clones on
aarch64.  However, the support in that patch is not fully compliant with our
ACLE specification.  I also have some unresolved questions about the
correctness of current function multiversioning implementations using ifuncs
across translation units, which could affect how we want to implement it for
aarch64.

Andrew

> > 
> > gcc/c-family/ChangeLog:
> > 
> > * c-attribs.cc (handle_target_version_attribute): New.
> > (c_common_attribute_table): Add target_version.
> > (handle_target_clones_attribute): Add conflict with
> > target_version attribute.
> > 
> > gcc/ChangeLog:
> > 
> > * attribs.cc (is_function_default_version): Update comment to
> > specify incompatibility with target_version attributes.
> > * cgraphclones.cc (cgraph_node::create_version_clone_with_body):
> > Call valid_version_attribute_p for target_version attributes.
> > * target.def (valid_version_attribute_p): New hook.
> > (expanded_clones_attribute): New hook.
> > * doc/tm.texi.in: Add new hooks.
> > * doc/tm.texi: Regenerate.
> > * multiple_target.cc (create_dispatcher_calls): Remove redundant
> > is_function_default_version check.
> > (expand_target_clones): Use target hook for attribute name.
> > * targhooks.cc (default_target_option_valid_version_attribute_p):
> > New.
> > * targhooks.h (default_target_option_valid_version_attribute_p):
> > New.
> > * tree.h (DECL_FUNCTION_VERSIONED): Update comment to include
> > target_version attributes.
> > 
> > gcc/cp/ChangeLog:
> > 
> > * decl2.cc (check_classfn): Update comment to include
> > target_version attributes.
> > 
> > 
> > diff --git a/gcc/attribs.cc b/gcc/attribs.cc
> > index 
> > b1300018d1e8ed8e02ded1ea721dc192a6d32a49..a3c4a81e8582ea4fd06b9518bf51fad7c998ddd6
> >  100644
> > --- a/gcc/attribs.cc
> > +++ b/gcc/attribs.cc
> > @@ -1233,8 +1233,9 @@ make_dispatcher_decl (const tree decl)
> >return func_decl;  
> >  }
> >  
> > -/* Returns true if decl is multi-versioned and DECL is the default 
> > function,
> > -   that is it is not tagged with target specific optimization.  */
> > +/* Returns true if DECL is multi-versioned using the target attribute, and 
> > this
> > +   is the default version.  This function can only be used for targets 
> > that do
> > +   not support the "target_version" attribute.  */
> >  
> >  bool
> >  is_function_default_version (const tree decl)
> > diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc
> > index 
> > 072cfb69147bd6b314459c0bd48a0c1fb92d3e4d..1a224c036277d51ab4dc0d33a403177bd226e48a
> >  100644
> > --- a/gcc/c-family/c-attribs.cc
> > +++ b/gcc/c-family/c-attribs.cc
> > @@ -148,6 +148,7 @@ static tree handle_alloc_align_attribute (tree *, tree,

Re: [PATCH] [ARC] Add support for HS4x cpus.

2018-07-06 Thread Andrew Burgess

* Claudiu Zissulescu  [2018-06-13 12:09:18 +0300]:

> From: Claudiu Zissulescu 
> 
> This patch adds support for two ARCHS variations.
> 
> Ok to apply?
> Claudiu

Sorry for the delay, this looks fine.

Thanks,
Andrew



> 
> gcc/
> 2017-03-10  Claudiu Zissulescu  
> 
>   * config/arc/arc-arch.h (arc_tune_attr): Add new tune parameters
>   for ARCHS4x.
>   * config/arc/arc-cpus.def (hs4x): New cpu.
>   (hs4xd): Likewise.
>   * config/arc/arc-tables.opt: Regenerate.
>   * config/arc/arc.c (arc_sched_issue_rate): New function.
>   (TARGET_SCHED_ISSUE_RATE): Define.
>   (TARGET_SCHED_EXPOSED_PIPELINE): Likewise.
>   * config/arc/arc.md (attr type): Add fpu_fuse, fpu_sdiv, fpu_ddiv,
>   fpu_cvt.
>   (attr tune): Add ARCHS4x tune values.
>   (attr tune_dspmpy): Define.
>   (*tst): Correct instruction type.
>   * config/arc/arcHS.md: Don't use this automaton for ARCHS4x cpus.
>   * config/arc/arcHS4x.md: New file.
>   * config/arc/fpu.md: Update instruction type attributes.
>   * config/arc/t-multilib: Regenerate.
> ---
>  gcc/config/arc/arc-arch.h |   5 +-
>  gcc/config/arc/arc-cpus.def   |   8 +-
>  gcc/config/arc/arc-tables.opt |   6 +
>  gcc/config/arc/arc.c  |  19 +++
>  gcc/config/arc/arc.md |  24 +++-
>  gcc/config/arc/arcHS.md   |   6 +
>  gcc/config/arc/arcHS4x.md | 221 ++
>  gcc/config/arc/fpu.md |  16 +--
>  8 files changed, 289 insertions(+), 16 deletions(-)
>  create mode 100644 gcc/config/arc/arcHS4x.md
> 
> diff --git a/gcc/config/arc/arc-arch.h b/gcc/config/arc/arc-arch.h
> index 64866dd529b..01f95946623 100644
> --- a/gcc/config/arc/arc-arch.h
> +++ b/gcc/config/arc/arc-arch.h
> @@ -73,7 +73,10 @@ enum arc_tune_attr
>  ARC_TUNE_ARC600,
>  ARC_TUNE_ARC700_4_2_STD,
>  ARC_TUNE_ARC700_4_2_XMAC,
> -ARC_TUNE_CORE_3
> +ARC_TUNE_CORE_3,
> +ARC_TUNE_ARCHS4X,
> +ARC_TUNE_ARCHS4XD,
> +ARC_TUNE_ARCHS4XD_SLOW
>};
>  
>  /* CPU specific properties.  */
> diff --git a/gcc/config/arc/arc-cpus.def b/gcc/config/arc/arc-cpus.def
> index 1fce81f6933..4aa422f1a39 100644
> --- a/gcc/config/arc/arc-cpus.def
> +++ b/gcc/config/arc/arc-cpus.def
> @@ -59,10 +59,12 @@ ARC_CPU (archs,hs, FL_MPYOPT_2|FL_DIVREM|FL_LL64, 
> NONE)
>  ARC_CPU (hs34,hs, FL_MPYOPT_2, NONE)
>  ARC_CPU (hs38,hs, FL_MPYOPT_9|FL_DIVREM|FL_LL64, NONE)
>  ARC_CPU (hs38_linux, hs, FL_MPYOPT_9|FL_DIVREM|FL_LL64|FL_FPU_FPUD_ALL, NONE)
> +ARC_CPU (hs4x,  hs, FL_MPYOPT_9|FL_DIVREM|FL_LL64, ARCHS4X)
> +ARC_CPU (hs4xd, hs, FL_MPYOPT_9|FL_DIVREM|FL_LL64, ARCHS4XD)
>  
> -ARC_CPU (arc600,   6xx, FL_BS, ARC600)
> -ARC_CPU (arc600_norm,  6xx, FL_BS|FL_NORM, ARC600)
> -ARC_CPU (arc600_mul64, 6xx, FL_BS|FL_NORM|FL_MUL64, ARC600)
> +ARC_CPU (arc600,  6xx, FL_BS, ARC600)
> +ARC_CPU (arc600_norm, 6xx, FL_BS|FL_NORM, ARC600)
> +ARC_CPU (arc600_mul64,6xx, FL_BS|FL_NORM|FL_MUL64, ARC600)
>  ARC_CPU (arc600_mul32x16, 6xx, FL_BS|FL_NORM|FL_MUL32x16, ARC600)
>  ARC_CPU (arc601,   6xx, 0, ARC600)
>  ARC_CPU (arc601_norm,  6xx, FL_NORM, ARC600)
> diff --git a/gcc/config/arc/arc-tables.opt b/gcc/config/arc/arc-tables.opt
> index 3b17b3de7d5..2afaf5bd83c 100644
> --- a/gcc/config/arc/arc-tables.opt
> +++ b/gcc/config/arc/arc-tables.opt
> @@ -63,6 +63,12 @@ Enum(processor_type) String(hs38) Value(PROCESSOR_hs38)
>  EnumValue
>  Enum(processor_type) String(hs38_linux) Value(PROCESSOR_hs38_linux)
>  
> +EnumValue
> +Enum(processor_type) String(hs4x) Value(PROCESSOR_hs4x)
> +
> +EnumValue
> +Enum(processor_type) String(hs4xd) Value(PROCESSOR_hs4xd)
> +
>  EnumValue
>  Enum(processor_type) String(arc600) Value(PROCESSOR_arc600)
>  
> diff --git a/gcc/config/arc/arc.c b/gcc/config/arc/arc.c
> index 2bedc9af37e..03a2f4223c0 100644
> --- a/gcc/config/arc/arc.c
> +++ b/gcc/config/arc/arc.c
> @@ -483,6 +483,22 @@ arc_autovectorize_vector_sizes (vector_sizes *sizes)
>  }
>  }
>  
> +
> +/* Implements target hook TARGET_SCHED_ISSUE_RATE.  */
> +static int
> +arc_sched_issue_rate (void)
> +{
> +  switch (arc_tune)
> +{
> +case TUNE_ARCHS4X:
> +case TUNE_ARCHS4XD:
> +  return 3;
> +default:
> +  break;
> +}
> +  return 1;
> +}
> +
>  /* TARGET_PRESERVE_RELOAD_P is still awaiting patch re-evaluation / review.  
> */
>  static bool arc_preserve_reload_p (rtx in) ATTRIBUTE_UNUSED;
>  static rtx arc_delegitimize_address (rtx);
> @@ -565,6 +581,9 @@ static rtx arc_legitimize_address_0 (rtx, rtx, 
> machine_mode mode);
>  #undef  TARGET_SCHE

Re: [PATCH, GCC, AARCH64] Add support for +profile extension

2018-07-09 Thread Andrew Pinski

On Mon, Jul 9, 2018 at 6:21 AM Andre Vieira (lists)
 wrote:
>
> Hi,
>
> This patch adds support for the Statistical Profiling Extension (SPE) on
> AArch64. Even though the compiler will not generate code any differently
> given this extension, it will need to pass it on to the assembler in
> order to let it correctly assemble inline asm containing accesses to the
> extension's system registers.  The same applies when using the
> preprocessor on an assembly file as this first must pass through cc1.
>
> I left the hwcaps string for SPE empty as the kernel does not define a
> feature string for this extension.  The current effect of this is that
> driver will disable profile feature bit in GCC.  This is OK though
> because we don't, nor do we ever, enable this feature bit, as codegen is
> not affect by the SPE support and more importantly the driver will still
> pass the extension down to the assembler regardless.
>
> Boostrapped aarch64-none-linux-gnu and ran regression tests.
>
> Is it OK for trunk?

I use a similar patch for the last year and half.

Thanks,
Andrew

>
> gcc/ChangeLog:
> 2018-07-09  Andre Vieira  
>
> * config/aarch64/aarch64-option-extensions.def: New entry for profile
> extension.
> * config/aarch64/aarch64.h (AARCH64_FL_PROFILE): New.
> * doc/invoke.texi (aarch64-feature-modifiers): New entry for profile
> extension.
>
> gcc/testsuite/ChangeLog:
> 2018-07-09 Andre Vieira 
>
> * gcc.target/aarch64/profile.c: New test.

Re: [RFC] Fix recent popcount change is breaking

2018-07-10 Thread Andrew Pinski

On Tue, Jul 10, 2018 at 6:14 PM Kugan Vivekanandarajah
 wrote:
>
> On 10 July 2018 at 23:17, Richard Biener  wrote:
> > On Tue, Jul 10, 2018 at 3:06 PM Kugan Vivekanandarajah
> >  wrote:
> >>
> >> Hi,
> >>
> >> Jeff told me that the recent popcount built-in detection is causing
> >> kernel build issues as
> >> ERROR: "__popcountsi2"
> >> [drivers/net/wireless/broadcom/brcm80211/brcmfmac/brcmfmac.ko] undefined!
> >>
> >> I could also reproduce this. AFIK, we should check if the libfunc is
> >> defined while checking popcount?
> >>
> >> I am testing the attached RFC patch. Is this reasonable?
> >
> > It doesn't work that way, all targets have this libfunc in libgcc.  This 
> > means
> > the kernel has to provide it.  The only thing you could do is restrict
> > replacement of CALL_EXPRs (in SCEV cprop) to those the target
> > natively supports.
>
> How about restricting it in expression_expensive_p ? Is that what you
> wanted. Attached patch does this.
> Bootstrap and regression testing progressing.

Seems like that should go into is_inexpensive_builtin  instead which
is just tested right below.

Thanks,
Andrew

>
> Thanks,
> Kugan
>
> >
> > Richard.
> >
> >> Thanks,
> >> Kugan
> >>
> >> gcc/ChangeLog:
> >>
> >> 2018-07-10  Kugan Vivekanandarajah  
> >>
> >> * tree-ssa-loop-niter.c (number_of_iterations_popcount): Check
> >> if libfunc for popcount is available.

Re: [RFC] Fix recent popcount change is breaking

2018-07-10 Thread Andrew Pinski

On Tue, Jul 10, 2018 at 6:35 PM Kugan Vivekanandarajah
 wrote:
>
> Hi Andrew,
>
> On 11 July 2018 at 11:19, Andrew Pinski  wrote:
> > On Tue, Jul 10, 2018 at 6:14 PM Kugan Vivekanandarajah
> >  wrote:
> >>
> >> On 10 July 2018 at 23:17, Richard Biener  
> >> wrote:
> >> > On Tue, Jul 10, 2018 at 3:06 PM Kugan Vivekanandarajah
> >> >  wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> Jeff told me that the recent popcount built-in detection is causing
> >> >> kernel build issues as
> >> >> ERROR: "__popcountsi2"
> >> >> [drivers/net/wireless/broadcom/brcm80211/brcmfmac/brcmfmac.ko] 
> >> >> undefined!
> >> >>
> >> >> I could also reproduce this. AFIK, we should check if the libfunc is
> >> >> defined while checking popcount?
> >> >>
> >> >> I am testing the attached RFC patch. Is this reasonable?
> >> >
> >> > It doesn't work that way, all targets have this libfunc in libgcc.  This 
> >> > means
> >> > the kernel has to provide it.  The only thing you could do is restrict
> >> > replacement of CALL_EXPRs (in SCEV cprop) to those the target
> >> > natively supports.
> >>
> >> How about restricting it in expression_expensive_p ? Is that what you
> >> wanted. Attached patch does this.
> >> Bootstrap and regression testing progressing.
> >
> > Seems like that should go into is_inexpensive_builtin  instead which
> > is just tested right below.
>
> I hought about that. is_inexpensive_builtin is used in various other
> places including some inlining decision so wasn't sure if it is the
> right thing. Happy to change it if that is the right thing to do.

I audited all of the users (and their users if it is used in a
wrapper) and found that is_inexpensive_builtin should return false for
this builtin if it is a function call in the end; there are other
builtins which should be checked the similar way but I think we should
not going to force you to do the similar thing for those builtins.

Thanks,
Andrew

>
> Thanks,
> Kugan
> >
> > Thanks,
> > Andrew
> >
> >>
> >> Thanks,
> >> Kugan
> >>
> >> >
> >> > Richard.
> >> >
> >> >> Thanks,
> >> >> Kugan
> >> >>
> >> >> gcc/ChangeLog:
> >> >>
> >> >> 2018-07-10  Kugan Vivekanandarajah  
> >> >>
> >> >> * tree-ssa-loop-niter.c (number_of_iterations_popcount): Check
> >> >> if libfunc for popcount is available.

Re: [PATCH 1/4] [ARC] Add more additional register names

2018-07-25 Thread Andrew Burgess

All the patches in this series look fine.

Thanks,
Andrew


* Claudiu Zissulescu  [2018-07-16 15:29:42 +0300]:

> From: claziss 
> 
> gcc/
> 2017-06-14  Claudiu Zissulescu  
> 
>   * config/arc/arc.h (ADDITIONAL_REGISTER_NAMES): Add additional
>   register names.
> ---
>  gcc/config/arc/arc.h | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/config/arc/arc.h b/gcc/config/arc/arc.h
> index 1780034aabe..3648314eaca 100644
> --- a/gcc/config/arc/arc.h
> +++ b/gcc/config/arc/arc.h
> @@ -1215,7 +1215,15 @@ extern char rname56[], rname57[], rname58[], rname59[];
>  {\
>{"ilink",  29},\
>{"r29",29},\
> -  {"r30",30} \
> +  {"r30",30},\
> +  {"r40",40},\
> +  {"r41",41},\
> +  {"r42",42},\
> +  {"r43",43},\
> +  {"r56",56},\
> +  {"r57",57},\
> +  {"r58",58},\
> +  {"r59",59} \
>  }
>  
>  /* Entry to the insn conditionalizer.  */
> -- 
> 2.17.1
>

Re: [PATCH][AARCH64] PR target/84521 Fix frame pointer corruption with -fomit-frame-pointer with __builtin_setjmp

2018-07-31 Thread Andrew Pinski

On Tue, Jul 31, 2018 at 2:43 PM James Greenhalgh
 wrote:
>
> On Thu, Jul 12, 2018 at 12:01:09PM -0500, Sudakshina Das wrote:
> > Hi Eric
> >
> > On 27/06/18 12:22, Wilco Dijkstra wrote:
> > > Eric Botcazou wrote:
> > >
> > >>> This test can easily be changed not to use optimize since it doesn't 
> > >>> look
> > >>> like it needs it. We really need to tests these builtins properly,
> > >>> otherwise they will continue to fail on most targets.
> > >>
> > >> As far as I can see PR target/84521 has been reported only for Aarch64 
> > >> so I'd
> > >> just leave the other targets alone (and avoid propagating FUD if 
> > >> possible).
> > >
> > > It's quite obvious from PR84521 that this is an issue affecting all 
> > > targets.
> > > Adding better generic tests for __builtin_setjmp can only be a good thing.
> > >
> > > Wilco
> > >
> >
> > This conversation seems to have died down and I would like to
> > start it again. I would agree with Wilco's suggestion about
> > keeping the test in the generic folder. I have removed the
> > optimize attribute and the effect is still the same. It passes
> > on AArch64 with this patch and it currently fails on x86
> > trunk (gcc version 9.0.0 20180712 (experimental) (GCC))
> > on -O1 and above.
>
>
> I don't see where the FUD comes in here; either this builtin has a defined
> semantics across targets and they are adhered to, or the builtin doesn't have
> well defined semantics, or the targets fail to implement those semantics.

The problem comes from the fact the builtins are not documented at all.
See PR59039 for the issue on them not being documented.

Thanks,
Andrew


>
> I think this should go in as is. If other targets are unhappy with the
> failing test they should fix their target or skip the test if it is not
> appropriate.
>
> You may want to CC some of the maintainers of platforms you know to fail as
> a courtesy on the PR (add your testcase, and add failing targets and their
> maintainers to that PR) before committing so it doesn't come as a complete
> surprise.
>
> This is OK with some attempt to get target maintainers involved in the
> conversation before commit.
>
> Thanks,
> James
>
> > diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> > index f284e74..9792d28 100644
> > --- a/gcc/config/aarch64/aarch64.h
> > +++ b/gcc/config/aarch64/aarch64.h
> > @@ -473,7 +473,9 @@ extern unsigned aarch64_architecture_version;
> >  #define EH_RETURN_STACKADJ_RTX   gen_rtx_REG (Pmode, R4_REGNUM)
> >  #define EH_RETURN_HANDLER_RTX  aarch64_eh_return_handler_rtx ()
> >
> > -/* Don't use __builtin_setjmp until we've defined it.  */
> > +/* Don't use __builtin_setjmp until we've defined it.
> > +   CAUTION: This macro is only used during exception unwinding.
> > +   Don't fall for its name.  */
> >  #undef DONT_USE_BUILTIN_SETJMP
> >  #define DONT_USE_BUILTIN_SETJMP 1
> >
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index 01f35f8..4266a3d 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -3998,7 +3998,7 @@ static bool
> >  aarch64_needs_frame_chain (void)
> >  {
> >/* Force a frame chain for EH returns so the return address is at FP+8.  
> > */
> > -  if (frame_pointer_needed || crtl->calls_eh_return)
> > +  if (frame_pointer_needed || crtl->calls_eh_return || 
> > cfun->has_nonlocal_label)
> >  return true;
> >
> >/* A leaf function cannot have calls or write LR.  */
> > @@ -12218,6 +12218,13 @@ aarch64_expand_builtin_va_start (tree valist, rtx 
> > nextarg ATTRIBUTE_UNUSED)
> >expand_expr (t, const0_rtx, VOIDmode, EXPAND_NORMAL);
> >  }
> >
> > +/* Implement TARGET_BUILTIN_SETJMP_FRAME_VALUE.  */
> > +static rtx
> > +aarch64_builtin_setjmp_frame_value (void)
> > +{
> > +  return hard_frame_pointer_rtx;
> > +}
> > +
> >  /* Implement TARGET_GIMPLIFY_VA_ARG_EXPR.  */
> >
> >  static tree
> > @@ -17744,6 +17751,9 @@ aarch64_run_selftests (void)
> >  #undef TARGET_FOLD_BUILTIN
> >  #define TARGET_FOLD_BUILTIN aarch64_fold_builtin
> >
> > +#undef TARGET_BUILTIN_SETJMP_FRAME_VALUE
> > +#define TARGET_BUILTIN_SETJMP_FRAME_VALUE 
> > aarch64_builtin_setjmp_frame_value
> > +
> >  #undef TARGET_FUNCTION_ARG
> >  #define TARGET_FUNCTION_ARG aarc

[PATCH] Add COMPLEX_VECTOR_INT modes

2023-05-26 Thread Andrew Stubbs


Hi all,

I want to implement a vector DIVMOD libfunc for amdgcn, but I can't just 
do it because the GCC middle-end models DIVMOD's return value as 
"complex int" type, and there are no vector equivalents of that type.


Therefore, this patch adds minimal support for "complex vector int" 
modes.  I have not attempted to provide any means to use these modes 
from C, so they're really only useful for DIVMOD.  The actual libfunc 
implementation will pack the data into wider vector modes manually.


A knock-on effect of this is that I needed to increase the range of 
"mode_unit_size" (several of the vector modes supported by amdgcn exceed 
the previous 255-byte limit).


Since this change would add a large number of new, unused modes to many 
architectures, I have elected to *not* enable them, by default, in 
machmode.def (where the other complex modes are created).  The new modes 
are therefore inactive on all architectures but amdgcn, for now.


OK for mainline?  (I've not done a full test yet, but I will.)

Thanks

AndrewAdd COMPLEX_VECTOR_INT modes for amdgcn

This enables only minimal support for complex types containing integer
vectors with the intention of allowing vectorized divmod libfunc operations
(these return a pair of integers modelled as a complex number).

There's no way to declare variables of this mode in the front-end, and no
attempt to support it everywhere that complex modes can exist; the only
use-case, at present, is the implicit use by divmod calls generated by
the middle-end.

In order to prevent unexpected problems with other architectures these
modes are only enabled for amdgcn.

gcc/ChangeLog:

* config/gcn/gcn-modes.def: Initialize COMPLEX_VECTOR_INT modes.
* genmodes.cc (complex_class): Support MODE_COMPLEX_VECTOR_INT.
(complete_mode): Likewise.
(emit_mode_unit_size): Upgrade mode_unit_size type to short.
(emit_mode_adjustments): Support MODE_COMPLEX_VECTOR_INT.
* machmode.def: Mention MODE_COMPLEX_VECTOR_INT.
* machmode.h (mode_to_unit_size): Upgrade type to short.
* mode-classes.def: Add MODE_COMPLEX_VECTOR_INT.
* stor-layout.cc (int_mode_for_mode): Support MODE_COMPLEX_VECTOR_INT.
* tree.cc (build_complex_type): Allow VECTOR_INTEGER_TYPE_P.

diff --git a/gcc/config/gcn/gcn-modes.def b/gcc/config/gcn/gcn-modes.def
index 1357bec825d..486168fbeb3 100644
--- a/gcc/config/gcn/gcn-modes.def
+++ b/gcc/config/gcn/gcn-modes.def
@@ -121,3 +121,6 @@ ADJUST_ALIGNMENT (V2TI, 16);
 ADJUST_ALIGNMENT (V2HF, 2);
 ADJUST_ALIGNMENT (V2SF, 4);
 ADJUST_ALIGNMENT (V2DF, 8);
+
+/* These are used for vectorized divmod.  */
+COMPLEX_MODES (VECTOR_INT);
diff --git a/gcc/genmodes.cc b/gcc/genmodes.cc
index 715787b8f48..d472ee5a9a3 100644
--- a/gcc/genmodes.cc
+++ b/gcc/genmodes.cc
@@ -125,6 +125,7 @@ complex_class (enum mode_class c)
 case MODE_INT: return MODE_COMPLEX_INT;
 case MODE_PARTIAL_INT: return MODE_COMPLEX_INT;
 case MODE_FLOAT: return MODE_COMPLEX_FLOAT;
+case MODE_VECTOR_INT: return MODE_COMPLEX_VECTOR_INT;
 default:
   error ("no complex class for class %s", mode_class_names[c]);
   return MODE_RANDOM;
@@ -382,6 +383,7 @@ complete_mode (struct mode_data *m)
 
 case MODE_COMPLEX_INT:
 case MODE_COMPLEX_FLOAT:
+case MODE_COMPLEX_VECTOR_INT:
   /* Complex modes should have a component indicated, but no more.  */
   validate_mode (m, UNSET, UNSET, SET, UNSET, UNSET);
   m->ncomponents = 2;
@@ -1173,10 +1175,10 @@ inline __attribute__((__always_inline__))\n\
 #else\n\
 extern __inline__ __attribute__((__always_inline__, __gnu_inline__))\n\
 #endif\n\
-unsigned char\n\
+unsigned short\n\
 mode_unit_size_inline (machine_mode mode)\n\
 {\n\
-  extern CONST_MODE_UNIT_SIZE unsigned char mode_unit_size[NUM_MACHINE_MODES];\
+  extern CONST_MODE_UNIT_SIZE unsigned short 
mode_unit_size[NUM_MACHINE_MODES];\
 \n\
   gcc_assert (mode >= 0 && mode < NUM_MACHINE_MODES);\n\
   switch (mode)\n\
@@ -1683,7 +1685,7 @@ emit_mode_unit_size (void)
   int c;
   struct mode_data *m;
 
-  print_maybe_const_decl ("%sunsigned char", "mode_unit_size",
+  print_maybe_const_decl ("%sunsigned short", "mode_unit_size",
  "NUM_MACHINE_MODES", adj_bytesize);
 
   for_all_modes (c, m)
@@ -1873,6 +1875,7 @@ emit_mode_adjustments (void)
{
case MODE_COMPLEX_INT:
case MODE_COMPLEX_FLOAT:
+case MODE_COMPLEX_VECTOR_INT:
  printf ("  mode_size[E_%smode] = 2*s;\n", m->name);
  printf ("  mode_unit_size[E_%smode] = s;\n", m->name);
  printf ("  mode_base_align[E_%smode] = s & (~s + 1);\n",
@@ -1920,6 +1923,7 @@ emit_mode_adjustments (void)
{
case MODE_COMPLEX_INT:
case MODE_COMPLEX_FLOAT:
+   case MODE_COMPLEX_VECTOR_INT:
  printf ("  mode_base_align[E_%smode] = s;\n", m->name);
  break;
 
diff --git a/gcc/machmode.def b/gcc/machmode

Re: [patch] amdgcn: Change -m(no-)xnack to -mxnack=(on,off,any)

2023-05-26 Thread Andrew Stubbs

OK.

Andrew
On 26/05/2023 15:58, Tobias Burnus wrote:
(Update the syntax of the amdgcn commandline option in anticipation of
later patches;
while -m(no-)xnack is in mainline since r12-2396-gaad32a00b7d2b6 (for
PR100208),
-mxsnack (contrary to -msram-ecc) is currently mostly a stub for later
patches
and is documented as such in invoke.texi. Thus, this change should have
no (or

only a minimal) affect on users.)

GCC currently supports for GCN -mxnack / -mno-xnack arguments, matching
+xnack and -xnack when passed to the LLVM linker. However, since V4 the
latter
supports three states, besides on/off there is now also unspecified.
That matches
the semantic of sram(-)ecc, which GCC already implements as 'on'/'off'
and 'any'.

Cf.
https://llvm.org/docs/AMDGPUUsage.html#target-features>> and
https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gcc/AMD-GCN-Options.html>>

The attached patch uses the sram-ecc flag syntax now also for xnack.
Note that currently only 'no' is supported which is ensured via a 'sorry'.
Hence, the default is 'no'. I assume we want to change the default once
XNACK is working - therefore, the documentation does only states the
current

default as a comment.

The changes were picked from the patch "amdgcn: Support XNACK mode" at
-
https://gcc.gnu.org/pipermail/gcc-patches/2022-July/597991.html>> - OG12 0229066ecb24421d48e3e0d56f31c30cc1affdab

- OG13 cbc3dd01de8788587a2b641efcb838058303b5ab
but only includes all changes related to the commandline option changes,
excluding the other changes like those to isns.

It additionally updates invoke.texi (using the wording from -msram-ecc).
(I actually encountered this issue because of the non-updated manual.)

Tested with full bootstrap, regtesting running, but not expecting
surprised.

OK for mainline?

Tobias

PS: For FIJI, "" is passed – that's ensured by NO_XNACK in the ASM_SPEC
and the 'switch' later in output_file_start (unchanged), otherwise 'xnack-'
is used (via the default in gcn.opt for the compiler and via XNACKOPT for
the command line.)
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201,
80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer:
Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München;
Registergericht München, HRB 106955

Re: [PATCH] Add COMPLEX_VECTOR_INT modes

2023-06-05 Thread Andrew Stubbs


On 30/05/2023 07:26, Richard Biener wrote:

On Fri, May 26, 2023 at 4:35 PM Andrew Stubbs  wrote:


Hi all,

I want to implement a vector DIVMOD libfunc for amdgcn, but I can't just
do it because the GCC middle-end models DIVMOD's return value as
"complex int" type, and there are no vector equivalents of that type.

Therefore, this patch adds minimal support for "complex vector int"
modes.  I have not attempted to provide any means to use these modes
from C, so they're really only useful for DIVMOD.  The actual libfunc
implementation will pack the data into wider vector modes manually.

A knock-on effect of this is that I needed to increase the range of
"mode_unit_size" (several of the vector modes supported by amdgcn exceed
the previous 255-byte limit).

Since this change would add a large number of new, unused modes to many
architectures, I have elected to *not* enable them, by default, in
machmode.def (where the other complex modes are created).  The new modes
are therefore inactive on all architectures but amdgcn, for now.

OK for mainline?  (I've not done a full test yet, but I will.)


I think it makes more sense to map vector CSImode to vector SImode with
the double number of lanes.  In fact since divmod is a libgcc function
I wonder where your vector variant would reside and how GCC decides to
emit calls to it?  That is, there's no way to OMP simd declare this function?


The divmod implementation lives in libgcc. It's not too difficult to 
write using vector extensions and some asm tricks. I did try an OMP simd 
declare implementation, but it didn't vectorize well, and that's a yack 
I don't wish to shave right now.


In any case, the OMP simd declare will not help us here, directly, 
because the DIVMOD transformation happens too late in the pass pipeline, 
long after ifcvt and vect. My implementation (not yet posted), uses a 
libfunc and the TARGET_EXPAND_DIVMOD_LIBFUNC hook in the standard way. 
It just needs the complex vector modes to exist.


Using vectors twice the length is problematic also. If I create a new 
V128SImode that spans across two 64-lane vector registers then that will 
probably have the desired effect ("real" quotient in v8, "imaginary" 
remainder in v9), but if I use V64SImode to represent two V32SImode 
vectors then that's a one-register mode, and I'll have to use a 
permutation (a memory operation) to extract lanes 32-63 into lanes 0-31, 
and if we ever want to implement instructions that operate on these 
modes (as opposed to the odd/even add/sub complex patterns we have now) 
then the masking will be all broken and we'd need to constantly 
disassemble the double length vectors to operate on them.


The implementation I proposed is essentially a struct containing two 
vectors placed in consecutive registers. This is the natural 
representation for the architecture.


Anyway, you don't like this patch and I see that AArch64 is picking 
apart BLKmode to see if there's complex inside, so maybe I can make 
something like that work here? AArch64 doesn't seem to use 
TARGET_EXPAND_DIVMOD_LIBFUNC though, and I'm pretty sure the problem I 
was trying to solve was in the way the expand pass handles the BLKmode 
complex, outside the control of the backend hook (I'm still paging this 
stuff back in, post vacation).


Thanks

Andrew

Re: [Patch] libgomp: plugin-gcn - support 'unified_address'

2023-06-06 Thread Andrew Stubbs


On 06/06/2023 16:33, Tobias Burnus wrote:

Andrew: Does the GCN change look okay to you?

This patch permits to use GCN devices with 'omp requires 
unified_address' which
in principle works already, except that the requirement handling did 
disable it.


(It also updates libgomp.texi for this change and likewise for an older 
likewise nvptx change.)


I will later add a testcase → 
https://gcc.gnu.org/PR109837>> However, the patch was tested with the respective sollve_vv testcase 
with an additional
fix applied on top → 
https://github.com/SOLLVE/sollve_vv/pull/737>> 
(I do note that with the USM patches for OG12/OG13, unified_address is 
accepted,
cf. OG13 
https://gcc.gnu.org/g:3ddf3565faee70e8c910d90ab0c80e71813a0ba1 ,

but USM itself goes much beyond what we need here.)


OK, I think this is fine. I was going to do this with the patch series 
soon anyway.


Andrew

Re: [PATCH] Add COMPLEX_VECTOR_INT modes

2023-06-09 Thread Andrew Stubbs


On 07/06/2023 20:42, Richard Sandiford wrote:

I don't know if this helps (probably not), but we have a similar
situation on AArch64: a 64-bit mode like V8QI can be doubled to a
128-bit vector or to a pair of 64-bit vectors.  We used V16QI for
the former and "V2x8QI" for the latter.  V2x8QI is forced to come
after V16QI in the mode list, and so it is only ever used through
explicit choice.  But both modes are functionally vectors of 16 QIs.


OK, that's interesting, but how do you map "complex int" vectors to that 
mode? I tried to figure it out, but there's no DIVMOD support so I 
couldn't just do a straight comparison.


Thanks

Andrew

Re: [PATCH] Add COMPLEX_VECTOR_INT modes

2023-06-09 Thread Andrew Stubbs


On 09/06/2023 10:02, Richard Sandiford wrote:

Andrew Stubbs  writes:

On 07/06/2023 20:42, Richard Sandiford wrote:

I don't know if this helps (probably not), but we have a similar
situation on AArch64: a 64-bit mode like V8QI can be doubled to a
128-bit vector or to a pair of 64-bit vectors.  We used V16QI for
the former and "V2x8QI" for the latter.  V2x8QI is forced to come
after V16QI in the mode list, and so it is only ever used through
explicit choice.  But both modes are functionally vectors of 16 QIs.


OK, that's interesting, but how do you map "complex int" vectors to that
mode? I tried to figure it out, but there's no DIVMOD support so I
couldn't just do a straight comparison.


Yeah, we don't do that currently.  Instead we make TARGET_ARRAY_MODE
return V2x8QI for an array of 2 V8QIs (which is OK, since V2x8QI has
64-bit rather than 128-bit alignment).  So we should use it for a
complex-y type like:

   struct { res_type res[2]; };

In principle we should be able to do the same for:

   struct { res_type a, b; };

but that isn't supported yet.  I think it would need a new target hook
along the lines of TARGET_ARRAY_MODE, but for structs rather than arrays.

The advantage of this from AArch64's PoV is that it extends to 3x and 4x
tuples as well, whereas complex is obviously for pairs only.

I don't know if it would be acceptable to use that kind of struct wrapper
for the divmod code though (for the vector case only).


Looking again, I don't think this will help because GCN does not have an 
instruction that loads vectors that are back-to-back, hence there's 
little benefit in adding the tuple mode.


However, GCN does have instructions that effectively load 2, 3, or 4 
vectors that are *interleaved*, which would be the likely case for 
complex numbers (or pixel colour data!)


I need to figure out how to move forward with this patch, please; if the 
new complex modes are not acceptable then I think I need to reimplement 
DIVMOD (maybe the scalars can remain as-is), but it's not clear to me 
what that would look like.


Andrew

[PATCH] vect: Vectorize via libfuncs

2023-06-13 Thread Andrew Stubbs

This patch allows vectorization when operators are available as 
libfuncs, rather that only as insns.


This will be useful for amdgcn where we plan to vectorize loops that 
contain integer division or modulus, but don't want to generate inline 
instructions for the division algorithm every time.


The change should have not affect architectures that do not define 
vector-mode libfuncs.


OK for mainline?

Andrewvect: vectorize via libfuncs

This patch allows vectorization when the libfuncs are defined.

gcc/ChangeLog:

* tree-vect-generic.cc: Include optabs-libfuncs.h.
(get_compute_type): Check optab_libfunc.
* tree-vect-stmts.cc: Include optabs-libfuncs.h.
(vectorizable_operation): Check optab_libfunc.

diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
index b7d4a919c55..4d784a70c0d 100644
--- a/gcc/tree-vect-generic.cc
+++ b/gcc/tree-vect-generic.cc
@@ -44,6 +44,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "gimple-match.h"
 #include "recog.h" /* FIXME: for insn_data */
+#include "optabs-libfuncs.h"
 
 
 /* Build a ternary operation and gimplify it.  Emit code before GSI.
@@ -1714,7 +1715,8 @@ get_compute_type (enum tree_code code, optab op, tree 
type)
   machine_mode compute_mode = TYPE_MODE (compute_type);
   if (VECTOR_MODE_P (compute_mode))
{
- if (op && optab_handler (op, compute_mode) != CODE_FOR_nothing)
+ if (op && (optab_handler (op, compute_mode) != CODE_FOR_nothing
+|| optab_libfunc (op, compute_mode)))
return compute_type;
  if (code == MULT_HIGHPART_EXPR
  && can_mult_highpart_p (compute_mode,
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index a7acc032d47..71a8cf2c6d4 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -56,6 +56,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "regs.h"
 #include "attribs.h"
+#include "optabs-libfuncs.h"
 
 /* For lang_hooks.types.type_for_mode.  */
 #include "langhooks.h"
@@ -6528,8 +6529,8 @@ vectorizable_operation (vec_info *vinfo,
  "no optab.\n");
  return false;
}
-  target_support_p = (optab_handler (optab, vec_mode)
- != CODE_FOR_nothing);
+  target_support_p = (optab_handler (optab, vec_mode) != CODE_FOR_nothing
+ || optab_libfunc (optab, vec_mode));
 }
 
   bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);

Re: [PATCH 3/3] AVX512 fully masked vectorization

2023-06-14 Thread Andrew Stubbs

SC-V
here since they are going to get both masks and lengths registered
I think.

The vect_prepare_for_masked_peels hunk might run into issues with
SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
looked odd.

Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
the testsuite with --param vect-partial-vector-usage=2 with and
without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
and one latent wrong-code (PR110237).

There's followup work to be done to try enabling masked epilogues
for x86-64 by default (when AVX512 is enabled, possibly only when
-mprefer-vector-width=512).  Getting cost modeling and decision
right is going to be challenging.

Any comments?

OK?

Btw, testing on GCN would be welcome - the _avx512 paths could
work for it so in case the while_ult path fails (not sure if
it ever does) it could get _avx512 style masking.  Likewise
testing on ARM just to see I didn't break anything here.
I don't have SVE hardware so testing is probably meaningless.


I can set some tests going. Is vect.exp enough?

Andrew

Re: [PATCH 3/3] AVX512 fully masked vectorization


On 14/06/2023 15:29, Richard Biener wrote:




Am 14.06.2023 um 16:27 schrieb Andrew Stubbs :

On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:

This implemens fully masked vectorization or a masked epilog for
AVX512 style masks which single themselves out by representing
each lane with a single bit and by using integer modes for the mask
(both is much like GCN).
AVX512 is also special in that it doesn't have any instruction
to compute the mask from a scalar IV like SVE has with while_ult.
Instead the masks are produced by vector compares and the loop
control retains the scalar IV (mainly to avoid dependences on
mask generation, a suitable mask test instruction is available).


This is also sounds like GCN. We currently use WHILE_ULT in the middle end which expands 
to a vector compare against a vector of stepped values. This requires an additional 
instruction to prepare the comparison vector (compared to SVE), but the 
"while_ultv64sidi" pattern (for example) returns the DImode bitmask, so it 
works reasonably well.


Like RVV code generation prefers a decrementing IV though IVOPTs
messes things up in some cases removing that IV to eliminate
it with an incrementing one used for address generation.
One of the motivating testcases is from PR108410 which in turn
is extracted from x264 where large size vectorization shows
issues with small trip loops.  Execution time there improves
compared to classic AVX512 with AVX2 epilogues for the cases
of less than 32 iterations.
size   scalar 128 256 512512e512f
 19.42   11.329.35   11.17   15.13   16.89
 25.726.536.666.667.628.56
 34.495.105.105.745.085.73
 44.104.334.295.213.794.25
 63.783.853.864.762.542.85
 83.641.893.764.501.922.16
123.562.213.754.261.261.42
163.360.831.064.160.951.07
203.391.421.334.070.750.85
243.230.661.724.220.620.70
283.181.092.044.200.540.61
323.160.470.410.410.470.53
343.160.670.610.560.440.50
383.190.950.950.820.400.45
423.090.581.211.130.360.40
'size' specifies the number of actual iterations, 512e is for
a masked epilog and 512f for the fully masked loop.  From
4 scalar iterations on the AVX512 masked epilog code is clearly
the winner, the fully masked variant is clearly worse and
it's size benefit is also tiny.


Let me check I understand correctly. In the fully masked case, there is a 
single loop in which a new mask is generated at the start of each iteration. In 
the masked epilogue case, the main loop uses no masking whatsoever, thus 
avoiding the need for generating a mask, carrying the mask, inserting vec_merge 
operations, etc, and then the epilogue looks much like the fully masked case, 
but unlike smaller mode epilogues there is no loop because the eplogue vector 
size is the same. Is that right?


Yes.


This scheme seems like it might also benefit GCN, in so much as it simplifies 
the hot code path.

GCN does not actually have smaller vector sizes, so there's no analogue to AVX2 
(we pretend we have some smaller sizes, but that's because the middle end can't 
do masking everywhere yet, and it helps make some vector constants smaller, 
perhaps).


This patch does not enable using fully masked loops or
masked epilogues by default.  More work on cost modeling
and vectorization kind selection on x86_64 is necessary
for this.
Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
which could be exploited further to unify some of the flags
we have right now but there didn't seem to be many easy things
to merge, so I'm leaving this for followups.
Mask requirements as registered by vect_record_loop_mask are kept in their
original form and recorded in a hash_set now instead of being
processed to a vector of rgroup_controls.  Instead that's now
left to the final analysis phase which tries forming the rgroup_controls
vector using while_ult and if that fails now tries AVX512 style
which needs a different organization and instead fills a hash_map
with the relevant info.  vect_get_loop_mask now has two implementations,
one for the two mask styles we then have.
I have decided against interweaving vect_set_loop_condition_partial_vectors
with conditions to do AVX512 style masking and instead opted to
"duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
I was split between making 'vec_loop_masks' a class with methods,
possibly merging in the _len stuff into a single registry.  It
seemed to be too many changes for the purpose of getti

Re: [PATCH 3/3] AVX512 fully masked vectorization


On 15/06/2023 10:58, Richard Biener wrote:

On Thu, 15 Jun 2023, Andrew Stubbs wrote:


On 14/06/2023 15:29, Richard Biener wrote:




Am 14.06.2023 um 16:27 schrieb Andrew Stubbs :

On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:

This implemens fully masked vectorization or a masked epilog for
AVX512 style masks which single themselves out by representing
each lane with a single bit and by using integer modes for the mask
(both is much like GCN).
AVX512 is also special in that it doesn't have any instruction
to compute the mask from a scalar IV like SVE has with while_ult.
Instead the masks are produced by vector compares and the loop
control retains the scalar IV (mainly to avoid dependences on
mask generation, a suitable mask test instruction is available).


This is also sounds like GCN. We currently use WHILE_ULT in the middle end
which expands to a vector compare against a vector of stepped values. This
requires an additional instruction to prepare the comparison vector
(compared to SVE), but the "while_ultv64sidi" pattern (for example) returns
the DImode bitmask, so it works reasonably well.


Like RVV code generation prefers a decrementing IV though IVOPTs
messes things up in some cases removing that IV to eliminate
it with an incrementing one used for address generation.
One of the motivating testcases is from PR108410 which in turn
is extracted from x264 where large size vectorization shows
issues with small trip loops.  Execution time there improves
compared to classic AVX512 with AVX2 epilogues for the cases
of less than 32 iterations.
size   scalar 128 256 512512e512f
  19.42   11.329.35   11.17   15.13   16.89
  25.726.536.666.667.628.56
  34.495.105.105.745.085.73
  44.104.334.295.213.794.25
  63.783.853.864.762.542.85
  83.641.893.764.501.922.16
 123.562.213.754.261.261.42
 163.360.831.064.160.951.07
 203.391.421.334.070.750.85
 243.230.661.724.220.620.70
 283.181.092.044.200.540.61
 323.160.470.410.410.470.53
 343.160.670.610.560.440.50
 383.190.950.950.820.400.45
 423.090.581.211.130.360.40
'size' specifies the number of actual iterations, 512e is for
a masked epilog and 512f for the fully masked loop.  From
4 scalar iterations on the AVX512 masked epilog code is clearly
the winner, the fully masked variant is clearly worse and
it's size benefit is also tiny.


Let me check I understand correctly. In the fully masked case, there is a
single loop in which a new mask is generated at the start of each
iteration. In the masked epilogue case, the main loop uses no masking
whatsoever, thus avoiding the need for generating a mask, carrying the
mask, inserting vec_merge operations, etc, and then the epilogue looks much
like the fully masked case, but unlike smaller mode epilogues there is no
loop because the eplogue vector size is the same. Is that right?


Yes.


This scheme seems like it might also benefit GCN, in so much as it
simplifies the hot code path.

GCN does not actually have smaller vector sizes, so there's no analogue to
AVX2 (we pretend we have some smaller sizes, but that's because the middle
end can't do masking everywhere yet, and it helps make some vector
constants smaller, perhaps).


This patch does not enable using fully masked loops or
masked epilogues by default.  More work on cost modeling
and vectorization kind selection on x86_64 is necessary
for this.
Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
which could be exploited further to unify some of the flags
we have right now but there didn't seem to be many easy things
to merge, so I'm leaving this for followups.
Mask requirements as registered by vect_record_loop_mask are kept in their
original form and recorded in a hash_set now instead of being
processed to a vector of rgroup_controls.  Instead that's now
left to the final analysis phase which tries forming the rgroup_controls
vector using while_ult and if that fails now tries AVX512 style
which needs a different organization and instead fills a hash_map
with the relevant info.  vect_get_loop_mask now has two implementations,
one for the two mask styles we then have.
I have decided against interweaving
vect_set_loop_condition_partial_vectors
with conditions to do AVX512 style masking and instead opted to
"duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
I was split between making 'vec_loop_masks' a class with methods,
possibly merging in the _len

Re: [PATCH 3/3] AVX512 fully masked vectorization


On 15/06/2023 12:06, Richard Biener wrote:

On Thu, 15 Jun 2023, Andrew Stubbs wrote:


On 15/06/2023 10:58, Richard Biener wrote:

On Thu, 15 Jun 2023, Andrew Stubbs wrote:


On 14/06/2023 15:29, Richard Biener wrote:




Am 14.06.2023 um 16:27 schrieb Andrew Stubbs :

On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:

This implemens fully masked vectorization or a masked epilog for
AVX512 style masks which single themselves out by representing
each lane with a single bit and by using integer modes for the mask
(both is much like GCN).
AVX512 is also special in that it doesn't have any instruction
to compute the mask from a scalar IV like SVE has with while_ult.
Instead the masks are produced by vector compares and the loop
control retains the scalar IV (mainly to avoid dependences on
mask generation, a suitable mask test instruction is available).


This is also sounds like GCN. We currently use WHILE_ULT in the middle
end
which expands to a vector compare against a vector of stepped values.
This
requires an additional instruction to prepare the comparison vector
(compared to SVE), but the "while_ultv64sidi" pattern (for example)
returns
the DImode bitmask, so it works reasonably well.


Like RVV code generation prefers a decrementing IV though IVOPTs
messes things up in some cases removing that IV to eliminate
it with an incrementing one used for address generation.
One of the motivating testcases is from PR108410 which in turn
is extracted from x264 where large size vectorization shows
issues with small trip loops.  Execution time there improves
compared to classic AVX512 with AVX2 epilogues for the cases
of less than 32 iterations.
size   scalar 128 256 512512e512f
   19.42   11.329.35   11.17   15.13   16.89
   25.726.536.666.667.628.56
   34.495.105.105.745.085.73
   44.104.334.295.213.794.25
   63.783.853.864.762.542.85
   83.641.893.764.501.922.16
  123.562.213.754.261.261.42
  163.360.831.064.160.951.07
  203.391.421.334.070.750.85
  243.230.661.724.220.620.70
  283.181.092.044.200.540.61
  323.160.470.410.410.470.53
  343.160.670.610.560.440.50
  383.190.950.950.820.400.45
  423.090.581.211.130.360.40
'size' specifies the number of actual iterations, 512e is for
a masked epilog and 512f for the fully masked loop.  From
4 scalar iterations on the AVX512 masked epilog code is clearly
the winner, the fully masked variant is clearly worse and
it's size benefit is also tiny.


Let me check I understand correctly. In the fully masked case, there is a
single loop in which a new mask is generated at the start of each
iteration. In the masked epilogue case, the main loop uses no masking
whatsoever, thus avoiding the need for generating a mask, carrying the
mask, inserting vec_merge operations, etc, and then the epilogue looks
much
like the fully masked case, but unlike smaller mode epilogues there is no
loop because the eplogue vector size is the same. Is that right?


Yes.


This scheme seems like it might also benefit GCN, in so much as it
simplifies the hot code path.

GCN does not actually have smaller vector sizes, so there's no analogue
to
AVX2 (we pretend we have some smaller sizes, but that's because the
middle
end can't do masking everywhere yet, and it helps make some vector
constants smaller, perhaps).


This patch does not enable using fully masked loops or
masked epilogues by default.  More work on cost modeling
and vectorization kind selection on x86_64 is necessary
for this.
Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
which could be exploited further to unify some of the flags
we have right now but there didn't seem to be many easy things
to merge, so I'm leaving this for followups.
Mask requirements as registered by vect_record_loop_mask are kept in
their
original form and recorded in a hash_set now instead of being
processed to a vector of rgroup_controls.  Instead that's now
left to the final analysis phase which tries forming the rgroup_controls
vector using while_ult and if that fails now tries AVX512 style
which needs a different organization and instead fills a hash_map
with the relevant info.  vect_get_loop_mask now has two implementations,
one for the two mask styles we then have.
I have decided against interweaving
vect_set_loop_condition_partial_vectors
with conditions to do AVX512 style masking and instead opted to
"duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
Likewise for vect_verify_full_masking vs
vect_verify_full_masking_av

Re: [PATCH 3/3] AVX512 fully masked vectorization


On 15/06/2023 14:34, Richard Biener wrote:

On Thu, 15 Jun 2023, Andrew Stubbs wrote:


On 15/06/2023 12:06, Richard Biener wrote:

On Thu, 15 Jun 2023, Andrew Stubbs wrote:


On 15/06/2023 10:58, Richard Biener wrote:

On Thu, 15 Jun 2023, Andrew Stubbs wrote:


On 14/06/2023 15:29, Richard Biener wrote:




Am 14.06.2023 um 16:27 schrieb Andrew Stubbs :

On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:

This implemens fully masked vectorization or a masked epilog for
AVX512 style masks which single themselves out by representing
each lane with a single bit and by using integer modes for the mask
(both is much like GCN).
AVX512 is also special in that it doesn't have any instruction
to compute the mask from a scalar IV like SVE has with while_ult.
Instead the masks are produced by vector compares and the loop
control retains the scalar IV (mainly to avoid dependences on
mask generation, a suitable mask test instruction is available).


This is also sounds like GCN. We currently use WHILE_ULT in the middle
end
which expands to a vector compare against a vector of stepped values.
This
requires an additional instruction to prepare the comparison vector
(compared to SVE), but the "while_ultv64sidi" pattern (for example)
returns
the DImode bitmask, so it works reasonably well.


Like RVV code generation prefers a decrementing IV though IVOPTs
messes things up in some cases removing that IV to eliminate
it with an incrementing one used for address generation.
One of the motivating testcases is from PR108410 which in turn
is extracted from x264 where large size vectorization shows
issues with small trip loops.  Execution time there improves
compared to classic AVX512 with AVX2 epilogues for the cases
of less than 32 iterations.
size   scalar 128 256 512512e512f
19.42   11.329.35   11.17   15.13   16.89
25.726.536.666.667.628.56
34.495.105.105.745.085.73
44.104.334.295.213.794.25
63.783.853.864.762.542.85
83.641.893.764.501.922.16
   123.562.213.754.261.261.42
   163.360.831.064.160.951.07
   203.391.421.334.070.750.85
   243.230.661.724.220.620.70
   283.181.092.044.200.540.61
   323.160.470.410.410.470.53
   343.160.670.610.560.440.50
   383.190.950.950.820.400.45
   423.090.581.211.130.360.40
'size' specifies the number of actual iterations, 512e is for
a masked epilog and 512f for the fully masked loop.  From
4 scalar iterations on the AVX512 masked epilog code is clearly
the winner, the fully masked variant is clearly worse and
it's size benefit is also tiny.


Let me check I understand correctly. In the fully masked case, there is
a
single loop in which a new mask is generated at the start of each
iteration. In the masked epilogue case, the main loop uses no masking
whatsoever, thus avoiding the need for generating a mask, carrying the
mask, inserting vec_merge operations, etc, and then the epilogue looks
much
like the fully masked case, but unlike smaller mode epilogues there is
no
loop because the eplogue vector size is the same. Is that right?


Yes.


This scheme seems like it might also benefit GCN, in so much as it
simplifies the hot code path.

GCN does not actually have smaller vector sizes, so there's no analogue
to
AVX2 (we pretend we have some smaller sizes, but that's because the
middle
end can't do masking everywhere yet, and it helps make some vector
constants smaller, perhaps).


This patch does not enable using fully masked loops or
masked epilogues by default.  More work on cost modeling
and vectorization kind selection on x86_64 is necessary
for this.
Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
which could be exploited further to unify some of the flags
we have right now but there didn't seem to be many easy things
to merge, so I'm leaving this for followups.
Mask requirements as registered by vect_record_loop_mask are kept in
their
original form and recorded in a hash_set now instead of being
processed to a vector of rgroup_controls.  Instead that's now
left to the final analysis phase which tries forming the
rgroup_controls
vector using while_ult and if that fails now tries AVX512 style
which needs a different organization and instead fills a hash_map
with the relevant info.  vect_get_loop_mask now has two
implementations,
one for the two mask styles we then have.
I have decided against interweaving
vect_set_loop_condition_partial_vectors
with conditions to do AVX512 style masking and instead opted to
"duplicate" this to vect_set_loop_cond

Re: [PATCH 3/3] AVX512 fully masked vectorization


On 15/06/2023 15:00, Richard Biener wrote:

On Thu, 15 Jun 2023, Andrew Stubbs wrote:


On 15/06/2023 14:34, Richard Biener wrote:

On Thu, 15 Jun 2023, Andrew Stubbs wrote:


On 15/06/2023 12:06, Richard Biener wrote:

On Thu, 15 Jun 2023, Andrew Stubbs wrote:


On 15/06/2023 10:58, Richard Biener wrote:

On Thu, 15 Jun 2023, Andrew Stubbs wrote:


On 14/06/2023 15:29, Richard Biener wrote:




Am 14.06.2023 um 16:27 schrieb Andrew Stubbs :

On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:

This implemens fully masked vectorization or a masked epilog for
AVX512 style masks which single themselves out by representing
each lane with a single bit and by using integer modes for the mask
(both is much like GCN).
AVX512 is also special in that it doesn't have any instruction
to compute the mask from a scalar IV like SVE has with while_ult.
Instead the masks are produced by vector compares and the loop
control retains the scalar IV (mainly to avoid dependences on
mask generation, a suitable mask test instruction is available).


This is also sounds like GCN. We currently use WHILE_ULT in the
middle
end
which expands to a vector compare against a vector of stepped values.
This
requires an additional instruction to prepare the comparison vector
(compared to SVE), but the "while_ultv64sidi" pattern (for example)
returns
the DImode bitmask, so it works reasonably well.


Like RVV code generation prefers a decrementing IV though IVOPTs
messes things up in some cases removing that IV to eliminate
it with an incrementing one used for address generation.
One of the motivating testcases is from PR108410 which in turn
is extracted from x264 where large size vectorization shows
issues with small trip loops.  Execution time there improves
compared to classic AVX512 with AVX2 epilogues for the cases
of less than 32 iterations.
size   scalar 128 256 512512e512f
 19.42   11.329.35   11.17   15.13   16.89
 25.726.536.666.667.628.56
 34.495.105.105.745.085.73
 44.104.334.295.213.794.25
 63.783.853.864.762.542.85
 83.641.893.764.501.922.16
123.562.213.754.261.261.42
163.360.831.064.160.951.07
203.391.421.334.070.750.85
243.230.661.724.220.620.70
283.181.092.044.200.540.61
323.160.470.410.410.470.53
343.160.670.610.560.440.50
383.190.950.950.820.400.45
423.090.581.211.130.360.40
'size' specifies the number of actual iterations, 512e is for
a masked epilog and 512f for the fully masked loop.  From
4 scalar iterations on the AVX512 masked epilog code is clearly
the winner, the fully masked variant is clearly worse and
it's size benefit is also tiny.


Let me check I understand correctly. In the fully masked case, there
is
a
single loop in which a new mask is generated at the start of each
iteration. In the masked epilogue case, the main loop uses no masking
whatsoever, thus avoiding the need for generating a mask, carrying
the
mask, inserting vec_merge operations, etc, and then the epilogue
looks
much
like the fully masked case, but unlike smaller mode epilogues there
is
no
loop because the eplogue vector size is the same. Is that right?


Yes.


This scheme seems like it might also benefit GCN, in so much as it
simplifies the hot code path.

GCN does not actually have smaller vector sizes, so there's no
analogue
to
AVX2 (we pretend we have some smaller sizes, but that's because the
middle
end can't do masking everywhere yet, and it helps make some vector
constants smaller, perhaps).


This patch does not enable using fully masked loops or
masked epilogues by default.  More work on cost modeling
and vectorization kind selection on x86_64 is necessary
for this.
Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
which could be exploited further to unify some of the flags
we have right now but there didn't seem to be many easy things
to merge, so I'm leaving this for followups.
Mask requirements as registered by vect_record_loop_mask are kept in
their
original form and recorded in a hash_set now instead of being
processed to a vector of rgroup_controls.  Instead that's now
left to the final analysis phase which tries forming the
rgroup_controls
vector using while_ult and if that fails now tries AVX512 style
which needs a different organization and instead fills a hash_map
with the relevant info.  vect_get_loop_mask now has two
implementations,
one for the two mask styles we then have.
I have decided against interweaving
vect_set_loop_condition_partial_vectors
wit

[PATCH 00/17] openmp, nvptx, amdgcn: 5.0 Memory Allocators

This patch series implements OpenMP allocators for low-latency memory on
nvptx, unified shared memory on both nvptx and amdgcn, and generic
pinned memory support for all Linux hosts (an nvptx-specific
implementation using Cuda pinned memory is planned for the future, as is
low-latency memory on amdgcn).

Patches 01 to 14 are reposts of patches previously submitted, now
forward ported to the current master branch and with the various
follow-up patches folded in. Where it conflicts with the new memkind
implementation the memkind takes precedence (but there's currently no way to
implement memory that's both high-bandwidth and pinned anyway).

Patches 15 to 17 are new work. I can probably approve these myself, but
they can't be committed until the rest of the series is approved.

Andrew

Andrew Stubbs (11):
  libgomp, nvptx: low-latency memory allocator
  libgomp: pinned memory
  libgomp, openmp: Add ompx_pinned_mem_alloc
  openmp, nvptx: low-lat memory access traits
  openmp, nvptx: ompx_unified_shared_mem_alloc
  openmp: Add -foffload-memory
  openmp: allow requires unified_shared_memory
  openmp: -foffload-memory=pinned
  amdgcn: Support XNACK mode
  amdgcn, openmp: Auto-detect USM mode and set HSA_XNACK
  amdgcn: libgomp plugin USM implementation

Hafiz Abid Qadeer (6):
  openmp: Use libgomp memory allocation functions with unified shared
memory.
  Add parsing support for allocate directive (OpenMP 5.0)
  Translate allocate directive (OpenMP 5.0).
  Handle cleanup of omp allocated variables (OpenMP 5.0).
  Gimplify allocate directive (OpenMP 5.0).
  Lower allocate directive (OpenMP 5.0).

 gcc/c/c-parser.cc |  22 +-
 gcc/common.opt|  16 +
 gcc/config/gcn/gcn-hsa.h  |   3 +-
 gcc/config/gcn/gcn-opts.h |  10 +-
 gcc/config/gcn/gcn-valu.md|  29 +-
 gcc/config/gcn/gcn.cc |  62 ++-
 gcc/config/gcn/gcn.md | 113 +++--
 gcc/config/gcn/gcn.opt|  18 +-
 gcc/config/gcn/mkoffload.cc   |  56 ++-
 gcc/coretypes.h   |   7 +
 gcc/cp/parser.cc  |  22 +-
 gcc/doc/gimple.texi   |  38 +-
 gcc/doc/invoke.texi   |  16 +-
 gcc/fortran/dump-parse-tree.cc|   3 +
 gcc/fortran/gfortran.h|   5 +-
 gcc/fortran/match.h   |   1 +
 gcc/fortran/openmp.cc | 242 ++-
 gcc/fortran/parse.cc  |  10 +-
 gcc/fortran/resolve.cc|   1 +
 gcc/fortran/st.cc |   1 +
 gcc/fortran/trans-decl.cc |  20 +
 gcc/fortran/trans-openmp.cc   |  50 +++
 gcc/fortran/trans.cc  |   1 +
 gcc/gimple-pretty-print.cc|  37 ++
 gcc/gimple.cc |  12 +
 gcc/gimple.def|   6 +
 gcc/gimple.h  |  60 ++-
 gcc/gimplify.cc   |  19 +
 gcc/gsstruct.def  |   1 +
 gcc/omp-builtins.def  |   3 +
 gcc/omp-low.cc| 383 +
 gcc/passes.def|   1 +
 .../c-c++-common/gomp/alloc-pinned-1.c|  28 ++
 gcc/testsuite/c-c++-common/gomp/usm-1.c   |   4 +
 gcc/testsuite/c-c++-common/gomp/usm-2.c   |  46 +++
 gcc/testsuite/c-c++-common/gomp/usm-3.c   |  44 ++
 gcc/testsuite/c-c++-common/gomp/usm-4.c   |   4 +
 gcc/testsuite/g++.dg/gomp/usm-1.C |  32 ++
 gcc/testsuite/g++.dg/gomp/usm-2.C |  30 ++
 gcc/testsuite/g++.dg/gomp/usm-3.C |  38 ++
 gcc/testsuite/gfortran.dg/gomp/allocate-4.f90 | 112 +
 gcc/testsuite/gfortran.dg/gomp/allocate-5.f90 |  73 
 gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 |  84 
 gcc/testsuite/gfortran.dg/gomp/allocate-7.f90 |  13 +
 gcc/testsuite/gfortran.dg/gomp/allocate-8.f90 |  15 +
 gcc/testsuite/gfortran.dg/gomp/usm-1.f90  |   6 +
 gcc/testsuite/gfortran.dg/gomp/usm-2.f90  |  16 +
 gcc/testsuite/gfortran.dg/gomp/usm-3.f90  |  13 +
 gcc/testsuite/gfortran.dg/gomp/usm-4.f90  |   6 +
 gcc/tree-core.h   |   9 +
 gcc/tree-pass.h   |   1 +
 gcc/tree-pretty-print.cc  |  23 ++
 gcc/tree.cc   |   1 +
 gcc/tree.def  |   4 +
 gcc/tree.h|  15 +
 include/cuda/cuda.h   |  12 +
 libgomp/allocator.c   | 304 ++
 libgomp/config/linux/allocator.c  | 137 +++
 libgomp/config/nvptx/allocator.c  | 387 ++
 libgomp/conf

[PATCH 02/17] libgomp: pinned memory


Implement the OpenMP pinned memory trait on Linux hosts using the mlock
syscall.  Pinned allocations are performed using mmap, not malloc, to ensure
that they can be unpinned safely when freed.

libgomp/ChangeLog:

* allocator.c (MEMSPACE_ALLOC): Add PIN.
(MEMSPACE_CALLOC): Add PIN.
(MEMSPACE_REALLOC): Add PIN.
(MEMSPACE_FREE): Add PIN.
(xmlock): New function.
(omp_init_allocator): Don't disallow the pinned trait.
(omp_aligned_alloc): Add pinning to all MEMSPACE_* calls.
(omp_aligned_calloc): Likewise.
(omp_realloc): Likewise.
(omp_free): Likewise.
* config/linux/allocator.c: New file.
* config/nvptx/allocator.c (MEMSPACE_ALLOC): Add PIN.
(MEMSPACE_CALLOC): Add PIN.
(MEMSPACE_REALLOC): Add PIN.
(MEMSPACE_FREE): Add PIN.
* testsuite/libgomp.c/alloc-pinned-1.c: New test.
* testsuite/libgomp.c/alloc-pinned-2.c: New test.
* testsuite/libgomp.c/alloc-pinned-3.c: New test.
* testsuite/libgomp.c/alloc-pinned-4.c: New test.
---
 libgomp/allocator.c  |  67 ++
 libgomp/config/linux/allocator.c |  99 ++
 libgomp/config/nvptx/allocator.c |   8 +-
 libgomp/testsuite/libgomp.c/alloc-pinned-1.c |  95 +
 libgomp/testsuite/libgomp.c/alloc-pinned-2.c | 101 ++
 libgomp/testsuite/libgomp.c/alloc-pinned-3.c | 130 ++
 libgomp/testsuite/libgomp.c/alloc-pinned-4.c | 132 +++
 7 files changed, 602 insertions(+), 30 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-4.c

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index 9b33bcf529b..54310ab93ca 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -39,16 +39,20 @@
 
 /* These macros may be overridden in config//allocator.c.  */
 #ifndef MEMSPACE_ALLOC
-#define MEMSPACE_ALLOC(MEMSPACE, SIZE) malloc (SIZE)
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \
+  (PIN ? NULL : malloc (SIZE))
 #endif
 #ifndef MEMSPACE_CALLOC
-#define MEMSPACE_CALLOC(MEMSPACE, SIZE) calloc (1, SIZE)
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE, PIN) \
+  (PIN ? NULL : calloc (1, SIZE))
 #endif
 #ifndef MEMSPACE_REALLOC
-#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) realloc (ADDR, SIZE)
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE, OLDPIN, PIN) \
+  ((PIN) || (OLDPIN) ? NULL : realloc (ADDR, SIZE))
 #endif
 #ifndef MEMSPACE_FREE
-#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) free (ADDR)
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \
+  (PIN ? NULL : free (ADDR))
 #endif
 
 /* Map the predefined allocators to the correct memory space.
@@ -351,10 +355,6 @@ omp_init_allocator (omp_memspace_handle_t memspace, int ntraits,
   break;
 }
 
-  /* No support for this so far.  */
-  if (data.pinned)
-return omp_null_allocator;
-
   ret = gomp_malloc (sizeof (struct omp_allocator_data));
   *ret = data;
 #ifndef HAVE_SYNC_BUILTINS
@@ -481,7 +481,8 @@ retry:
 	}
   else
 #endif
-	ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size);
+	ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size,
+			  allocator_data->pinned);
   if (ptr == NULL)
 	{
 #ifdef HAVE_SYNC_BUILTINS
@@ -511,7 +512,8 @@ retry:
 	= (allocator_data
 	   ? allocator_data->memspace
 	   : predefined_alloc_mapping[allocator]);
-	  ptr = MEMSPACE_ALLOC (memspace, new_size);
+	  ptr = MEMSPACE_ALLOC (memspace, new_size,
+allocator_data && allocator_data->pinned);
 	}
   if (ptr == NULL)
 	goto fail;
@@ -542,9 +544,9 @@ fail:
 #ifdef LIBGOMP_USE_MEMKIND
 	  || memkind
 #endif
-	  || (allocator_data
-	  && allocator_data->pool_size < ~(uintptr_t) 0)
-	  || !allocator_data)
+	  || !allocator_data
+	  || allocator_data->pool_size < ~(uintptr_t) 0
+	  || allocator_data->pinned)
 	{
 	  allocator = omp_default_mem_alloc;
 	  goto retry;
@@ -596,6 +598,7 @@ omp_free (void *ptr, omp_allocator_handle_t allocator)
   struct omp_mem_header *data;
   omp_memspace_handle_t memspace __attribute__((unused))
 = omp_default_mem_space;
+  int pinned __attribute__((unused)) = false;
 
   if (ptr == NULL)
 return;
@@ -627,6 +630,7 @@ omp_free (void *ptr, omp_allocator_handle_t allocator)
 #endif
 
   memspace = allocator_data->memspace;
+  pinned = allocator_data->pinned;
 }
   else
 {
@@ -651,7 +655,7 @@ omp_free (void *ptr, omp_allocator_handle_t allocator)
   memspace = predefined_alloc_mapping[data->allocator];
 }
 
-  MEMSPACE_FREE (memspace, data->ptr, data->size);
+  MEMSPACE_FREE (memspace, data->ptr, data->size, pinned);
 }
 
 ialias (omp_free)
@@ -767,7 +771,8 @@ retry:
 	}
   else
 #endif
-	ptr = MEMSPACE_CALLOC (allocator_data->memspace, new_size);
+	ptr =

[PATCH 01/17] libgomp, nvptx: low-latency memory allocator


This patch adds support for allocating low-latency ".shared" memory on
NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc.  The memory
can be allocated, reallocated, and freed using a basic but fast algorithm,
is thread safe and the size of the low-latency heap can be configured using
the GOMP_NVPTX_LOWLAT_POOL environment variable.

The use of the PTX dynamic_smem_size feature means that low-latency allocator
will not work with the PTX 3.1 multilib.

libgomp/ChangeLog:

* allocator.c (MEMSPACE_ALLOC): New macro.
(MEMSPACE_CALLOC): New macro.
(MEMSPACE_REALLOC): New macro.
(MEMSPACE_FREE): New macro.
(dynamic_smem_size): New constants.
(omp_alloc): Use MEMSPACE_ALLOC.
Implement fall-backs for predefined allocators.
(omp_free): Use MEMSPACE_FREE.
(omp_calloc): Use MEMSPACE_CALLOC.
Implement fall-backs for predefined allocators.
(omp_realloc): Use MEMSPACE_REALLOC and MEMSPACE_ALLOC..
Implement fall-backs for predefined allocators.
* config/nvptx/team.c (__nvptx_lowlat_heap_root): New variable.
(__nvptx_lowlat_pool): New asm varaible.
(gomp_nvptx_main): Initialize the low-latency heap.
* plugin/plugin-nvptx.c (lowlat_pool_size): New variable.
(GOMP_OFFLOAD_init_device): Read the GOMP_NVPTX_LOWLAT_POOL envvar.
(GOMP_OFFLOAD_run): Apply lowlat_pool_size.
* config/nvptx/allocator.c: New file.
* testsuite/libgomp.c/allocators-1.c: New test.
* testsuite/libgomp.c/allocators-2.c: New test.
* testsuite/libgomp.c/allocators-3.c: New test.
* testsuite/libgomp.c/allocators-4.c: New test.
* testsuite/libgomp.c/allocators-5.c: New test.
* testsuite/libgomp.c/allocators-6.c: New test.

co-authored-by: Kwok Cheung Yeung  
---
 libgomp/allocator.c| 235 -
 libgomp/config/nvptx/allocator.c   | 370 +
 libgomp/config/nvptx/team.c|  28 ++
 libgomp/plugin/plugin-nvptx.c  |  23 +-
 libgomp/testsuite/libgomp.c/allocators-1.c |  56 
 libgomp/testsuite/libgomp.c/allocators-2.c |  64 
 libgomp/testsuite/libgomp.c/allocators-3.c |  42 +++
 libgomp/testsuite/libgomp.c/allocators-4.c | 196 +++
 libgomp/testsuite/libgomp.c/allocators-5.c |  63 
 libgomp/testsuite/libgomp.c/allocators-6.c | 117 +++
 10 files changed, 1110 insertions(+), 84 deletions(-)
 create mode 100644 libgomp/config/nvptx/allocator.c
 create mode 100644 libgomp/testsuite/libgomp.c/allocators-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/allocators-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/allocators-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/allocators-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/allocators-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/allocators-6.c

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index b04820b8cf9..9b33bcf529b 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -37,6 +37,34 @@
 
 #define omp_max_predefined_alloc omp_thread_mem_alloc
 
+/* These macros may be overridden in config//allocator.c.  */
+#ifndef MEMSPACE_ALLOC
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE) malloc (SIZE)
+#endif
+#ifndef MEMSPACE_CALLOC
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE) calloc (1, SIZE)
+#endif
+#ifndef MEMSPACE_REALLOC
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) realloc (ADDR, SIZE)
+#endif
+#ifndef MEMSPACE_FREE
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) free (ADDR)
+#endif
+
+/* Map the predefined allocators to the correct memory space.
+   The index to this table is the omp_allocator_handle_t enum value.  */
+static const omp_memspace_handle_t predefined_alloc_mapping[] = {
+  omp_default_mem_space,   /* omp_null_allocator. */
+  omp_default_mem_space,   /* omp_default_mem_alloc. */
+  omp_large_cap_mem_space, /* omp_large_cap_mem_alloc. */
+  omp_default_mem_space,   /* omp_const_mem_alloc. */
+  omp_high_bw_mem_space,   /* omp_high_bw_mem_alloc. */
+  omp_low_lat_mem_space,   /* omp_low_lat_mem_alloc. */
+  omp_low_lat_mem_space,   /* omp_cgroup_mem_alloc. */
+  omp_low_lat_mem_space,   /* omp_pteam_mem_alloc. */
+  omp_low_lat_mem_space,   /* omp_thread_mem_alloc. */
+};
+
 enum gomp_memkind_kind
 {
   GOMP_MEMKIND_NONE = 0,
@@ -453,7 +481,7 @@ retry:
 	}
   else
 #endif
-	ptr = malloc (new_size);
+	ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size);
   if (ptr == NULL)
 	{
 #ifdef HAVE_SYNC_BUILTINS
@@ -478,7 +506,13 @@ retry:
 	}
   else
 #endif
-	ptr = malloc (new_size);
+	{
+	  omp_memspace_handle_t memspace __attribute__((unused))
+	= (allocator_data
+	   ? allocator_data->memspace
+	   : predefined_alloc_mapping[allocator]);
+	  ptr = MEMSPACE_ALLOC (memspace, new_size);
+	}
   if (ptr == NULL)
 	goto fail;
 }
@@ -496,35 +530,38 @@ retry:
   return ret;
 
 fail:
-  if (allocator_data)
+  int fallback = (allocator_da

[PATCH 04/17] openmp, nvptx: low-lat memory access traits


The NVPTX low latency memory is not accessible outside the team that allocates
it, and therefore should be unavailable for allocators with the access trait
"all".  This change means that the omp_low_lat_mem_alloc predefined
allocator now implicitly implies the "pteam" trait.

libgomp/ChangeLog:

* allocator.c (MEMSPACE_VALIDATE): New macro.
(omp_aligned_alloc): Use MEMSPACE_VALIDATE.
(omp_aligned_calloc): Likewise.
(omp_realloc): Likewise.
* config/nvptx/allocator.c (nvptx_memspace_validate): New function.
(MEMSPACE_VALIDATE): New macro.
* testsuite/libgomp.c/allocators-4.c (main): Add access trait.
* testsuite/libgomp.c/allocators-6.c (main): Add access trait.
* testsuite/libgomp.c/allocators-7.c: New test.
---
 libgomp/allocator.c| 15 +
 libgomp/config/nvptx/allocator.c   | 11 
 libgomp/testsuite/libgomp.c/allocators-4.c |  7 ++-
 libgomp/testsuite/libgomp.c/allocators-6.c |  7 ++-
 libgomp/testsuite/libgomp.c/allocators-7.c | 68 ++
 5 files changed, 102 insertions(+), 6 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c/allocators-7.c

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index 029d0d40a36..48ab0782e6b 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -54,6 +54,9 @@
 #define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \
   (PIN ? NULL : free (ADDR))
 #endif
+#ifndef MEMSPACE_VALIDATE
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) 1
+#endif
 
 /* Map the predefined allocators to the correct memory space.
The index to this table is the omp_allocator_handle_t enum value.  */
@@ -438,6 +441,10 @@ retry:
   if (__builtin_add_overflow (size, new_size, &new_size))
 goto fail;
 
+  if (allocator_data
+  && !MEMSPACE_VALIDATE (allocator_data->memspace, allocator_data->access))
+goto fail;
+
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
 {
@@ -733,6 +740,10 @@ retry:
   if (__builtin_add_overflow (size_temp, new_size, &new_size))
 goto fail;
 
+  if (allocator_data
+  && !MEMSPACE_VALIDATE (allocator_data->memspace, allocator_data->access))
+goto fail;
+
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
 {
@@ -964,6 +975,10 @@ retry:
 goto fail;
   old_size = data->size;
 
+  if (allocator_data
+  && !MEMSPACE_VALIDATE (allocator_data->memspace, allocator_data->access))
+goto fail;
+
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
 {
diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c
index f740b97f6ac..0102680b717 100644
--- a/libgomp/config/nvptx/allocator.c
+++ b/libgomp/config/nvptx/allocator.c
@@ -358,6 +358,15 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
 return realloc (addr, size);
 }
 
+static inline int
+nvptx_memspace_validate (omp_memspace_handle_t memspace, unsigned access)
+{
+  /* Disallow use of low-latency memory when it must be accessible by
+ all threads.  */
+  return (memspace != omp_low_lat_mem_space
+	  || access != omp_atv_all);
+}
+
 #define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \
   nvptx_memspace_alloc (MEMSPACE, SIZE)
 #define MEMSPACE_CALLOC(MEMSPACE, SIZE, PIN) \
@@ -366,5 +375,7 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
   nvptx_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE)
 #define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \
   nvptx_memspace_free (MEMSPACE, ADDR, SIZE)
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \
+  nvptx_memspace_validate (MEMSPACE, ACCESS)
 
 #include "../../allocator.c"
diff --git a/libgomp/testsuite/libgomp.c/allocators-4.c b/libgomp/testsuite/libgomp.c/allocators-4.c
index 9fa6aa1624f..cae27ea33c1 100644
--- a/libgomp/testsuite/libgomp.c/allocators-4.c
+++ b/libgomp/testsuite/libgomp.c/allocators-4.c
@@ -23,10 +23,11 @@ main ()
   #pragma omp target
   {
 /* Ensure that the memory we get *is* low-latency with a null-fallback.  */
-omp_alloctrait_t traits[1]
-  = { { omp_atk_fallback, omp_atv_null_fb } };
+omp_alloctrait_t traits[2]
+  = { { omp_atk_fallback, omp_atv_null_fb },
+  { omp_atk_access, omp_atv_pteam } };
 omp_allocator_handle_t lowlat = omp_init_allocator (omp_low_lat_mem_space,
-			1, traits);
+			2, traits);
 
 int size = 4;
 
diff --git a/libgomp/testsuite/libgomp.c/allocators-6.c b/libgomp/testsuite/libgomp.c/allocators-6.c
index 90bf73095ef..c03233df582 100644
--- a/libgomp/testsuite/libgomp.c/allocators-6.c
+++ b/libgomp/testsuite/libgomp.c/allocators-6.c
@@ -23,10 +23,11 @@ main ()
   #pragma omp target
   {
 /* Ensure that the memory we get *is* low-latency with a null-fallback.  */
-omp_alloctrait_t traits[1]
-  = { { omp_atk_fallback, omp_atv_null_fb } };
+omp_alloctrait_t traits[2]
+  = { { omp_atk_fallback, omp_atv_null_fb },

[PATCH 03/17] libgomp, openmp: Add ompx_pinned_mem_alloc


This creates a new predefined allocator as a shortcut for using pinned
memory with OpenMP.  The name uses the OpenMP extension space and is
intended to be consistent with other OpenMP implementations currently in
development.

The allocator is equivalent to using a custom allocator with the pinned
trait and the null fallback trait.

libgomp/ChangeLog:

* allocator.c (omp_max_predefined_alloc): Update.
(omp_aligned_alloc): Support ompx_pinned_mem_alloc.
(omp_free): Likewise.
(omp_aligned_calloc): Likewise.
(omp_realloc): Likewise.
* omp.h.in (omp_allocator_handle_t): Add ompx_pinned_mem_alloc.
* omp_lib.f90.in: Add ompx_pinned_mem_alloc.
* testsuite/libgomp.c/alloc-pinned-5.c: New test.
* testsuite/libgomp.c/alloc-pinned-6.c: New test.
* testsuite/libgomp.fortran/alloc-pinned-1.f90: New test.
---
 libgomp/allocator.c   |  60 +++
 libgomp/omp.h.in  |   1 +
 libgomp/omp_lib.f90.in|   2 +
 libgomp/testsuite/libgomp.c/alloc-pinned-5.c  |  90 
 libgomp/testsuite/libgomp.c/alloc-pinned-6.c  | 101 ++
 .../libgomp.fortran/alloc-pinned-1.f90|  16 +++
 6 files changed, 252 insertions(+), 18 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-6.c
 create mode 100644 libgomp/testsuite/libgomp.fortran/alloc-pinned-1.f90

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index 54310ab93ca..029d0d40a36 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -35,7 +35,7 @@
 #include 
 #endif
 
-#define omp_max_predefined_alloc omp_thread_mem_alloc
+#define omp_max_predefined_alloc ompx_pinned_mem_alloc
 
 /* These macros may be overridden in config//allocator.c.  */
 #ifndef MEMSPACE_ALLOC
@@ -67,6 +67,7 @@ static const omp_memspace_handle_t predefined_alloc_mapping[] = {
   omp_low_lat_mem_space,   /* omp_cgroup_mem_alloc. */
   omp_low_lat_mem_space,   /* omp_pteam_mem_alloc. */
   omp_low_lat_mem_space,   /* omp_thread_mem_alloc. */
+  omp_default_mem_space,   /* ompx_pinned_mem_alloc. */
 };
 
 enum gomp_memkind_kind
@@ -512,8 +513,11 @@ retry:
 	= (allocator_data
 	   ? allocator_data->memspace
 	   : predefined_alloc_mapping[allocator]);
-	  ptr = MEMSPACE_ALLOC (memspace, new_size,
-allocator_data && allocator_data->pinned);
+	  int pinned __attribute__((unused))
+	= (allocator_data
+	   ? allocator_data->pinned
+	   : allocator == ompx_pinned_mem_alloc);
+	  ptr = MEMSPACE_ALLOC (memspace, new_size, pinned);
 	}
   if (ptr == NULL)
 	goto fail;
@@ -534,7 +538,8 @@ retry:
 fail:
   int fallback = (allocator_data
 		  ? allocator_data->fallback
-		  : allocator == omp_default_mem_alloc
+		  : (allocator == omp_default_mem_alloc
+		 || allocator == ompx_pinned_mem_alloc)
 		  ? omp_atv_null_fb
 		  : omp_atv_default_mem_fb);
   switch (fallback)
@@ -653,6 +658,7 @@ omp_free (void *ptr, omp_allocator_handle_t allocator)
 #endif
 
   memspace = predefined_alloc_mapping[data->allocator];
+  pinned = (data->allocator == ompx_pinned_mem_alloc);
 }
 
   MEMSPACE_FREE (memspace, data->ptr, data->size, pinned);
@@ -802,8 +808,11 @@ retry:
 	= (allocator_data
 	   ? allocator_data->memspace
 	   : predefined_alloc_mapping[allocator]);
-	  ptr = MEMSPACE_CALLOC (memspace, new_size,
- allocator_data && allocator_data->pinned);
+	  int pinned __attribute__((unused))
+	= (allocator_data
+	   ? allocator_data->pinned
+	   : allocator == ompx_pinned_mem_alloc);
+	  ptr = MEMSPACE_CALLOC (memspace, new_size, pinned);
 	}
   if (ptr == NULL)
 	goto fail;
@@ -824,7 +833,8 @@ retry:
 fail:
   int fallback = (allocator_data
 		  ? allocator_data->fallback
-		  : allocator == omp_default_mem_alloc
+		  : (allocator == omp_default_mem_alloc
+		 || allocator == ompx_pinned_mem_alloc)
 		  ? omp_atv_null_fb
 		  : omp_atv_default_mem_fb);
   switch (fallback)
@@ -1026,11 +1036,15 @@ retry:
   else
 #endif
   if (prev_size)
-	new_ptr = MEMSPACE_REALLOC (allocator_data->memspace, data->ptr,
-data->size, new_size,
-(free_allocator_data
- && free_allocator_data->pinned),
-allocator_data->pinned);
+	{
+	  int was_pinned __attribute__((unused))
+	= (free_allocator_data
+	   ? free_allocator_data->pinned
+	   : free_allocator == ompx_pinned_mem_alloc);
+	  new_ptr = MEMSPACE_REALLOC (allocator_data->memspace, data->ptr,
+  data->size, new_size, was_pinned,
+  allocator_data->pinned);
+	}
   else
 	new_ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size,
   allocator_data->pinned);
@@ -1079,10 +1093,16 @@ retry:
 	= (allocator_data
 	   ? allocator_data->memspace
 	   : predefined_alloc_mapping[allocator]);
+	  int was_pinned __attribute__((unused))
+	= (free_allocator_data

[PATCH 09/17] openmp: Use libgomp memory allocation functions with unified shared memory.


This patches changes calls to malloc/free/calloc/realloc and operator new to
memory allocation functions in libgomp with
allocator=ompx_unified_shared_mem_alloc.  This helps existing code to benefit
from the unified shared memory.  The libgomp does the correct thing with all
the mapping constructs and there is no memory copies if the pointer is pointing
to unified shared memory.

We only replace replacable new operator and not the class member or placement 
new.

gcc/ChangeLog:

* omp-low.cc (usm_transform): New function.
(make_pass_usm_transform): Likewise.
(class pass_usm_transform): New.
* passes.def: Add pass_usm_transform.
* tree-pass.h (make_pass_usm_transform): New declaration.

gcc/testsuite/ChangeLog:

* c-c++-common/gomp/usm-2.c: New test.
* c-c++-common/gomp/usm-3.c: New test.
* g++.dg/gomp/usm-1.C: New test.
* g++.dg/gomp/usm-2.C: New test.
* g++.dg/gomp/usm-3.C: New test.
* gfortran.dg/gomp/usm-2.f90: New test.
* gfortran.dg/gomp/usm-3.f90: New test.

libgomp/ChangeLog:

* testsuite/libgomp.c/usm-6.c: New test.
* testsuite/libgomp.c++/usm-1.C: Likewise.

co-authored-by: Andrew Stubbs  
---
 gcc/omp-low.cc   | 174 +++
 gcc/passes.def   |   1 +
 gcc/testsuite/c-c++-common/gomp/usm-2.c  |  46 ++
 gcc/testsuite/c-c++-common/gomp/usm-3.c  |  44 ++
 gcc/testsuite/g++.dg/gomp/usm-1.C|  32 +
 gcc/testsuite/g++.dg/gomp/usm-2.C|  30 
 gcc/testsuite/g++.dg/gomp/usm-3.C|  38 +
 gcc/testsuite/gfortran.dg/gomp/usm-2.f90 |  16 +++
 gcc/testsuite/gfortran.dg/gomp/usm-3.f90 |  13 ++
 gcc/tree-pass.h  |   1 +
 libgomp/testsuite/libgomp.c++/usm-1.C|  54 +++
 libgomp/testsuite/libgomp.c/usm-6.c  |  92 
 12 files changed, 541 insertions(+)
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-2.c
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-3.c
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-1.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-2.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-3.C
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-2.f90
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-3.f90
 create mode 100644 libgomp/testsuite/libgomp.c++/usm-1.C
 create mode 100644 libgomp/testsuite/libgomp.c/usm-6.c

diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index ba612e5c67d..cdadd6f0c96 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -15097,6 +15097,180 @@ make_pass_diagnose_omp_blocks (gcc::context *ctxt)
 {
   return new pass_diagnose_omp_blocks (ctxt);
 }
+
+/* Provide transformation required for using unified shared memory
+   by replacing calls to standard memory allocation functions with
+   function provided by the libgomp.  */
+
+static tree
+usm_transform (gimple_stmt_iterator *gsi_p, bool *,
+	   struct walk_stmt_info *wi)
+{
+  gimple *stmt = gsi_stmt (*gsi_p);
+  /* ompx_unified_shared_mem_alloc is 10.  */
+  const unsigned int unified_shared_mem_alloc = 10;
+
+  switch (gimple_code (stmt))
+{
+case GIMPLE_CALL:
+  {
+	gcall *gs = as_a  (stmt);
+	tree fndecl = gimple_call_fndecl (gs);
+	if (fndecl)
+	  {
+	tree allocator = build_int_cst (pointer_sized_int_node,
+	unified_shared_mem_alloc);
+	const char *name = IDENTIFIER_POINTER (DECL_NAME (fndecl));
+	if ((strcmp (name, "malloc") == 0)
+		 || (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL)
+		 && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_MALLOC)
+		 || DECL_IS_REPLACEABLE_OPERATOR_NEW_P (fndecl)
+		 || strcmp (name, "omp_target_alloc") == 0)
+	  {
+		  tree omp_alloc_type
+		= build_function_type_list (ptr_type_node, size_type_node,
+		pointer_sized_int_node,
+		NULL_TREE);
+		tree repl = build_fn_decl ("omp_alloc", omp_alloc_type);
+		tree size = gimple_call_arg (gs, 0);
+		gimple *g = gimple_build_call (repl, 2, size, allocator);
+		gimple_call_set_lhs (g, gimple_call_lhs (gs));
+		gimple_set_location (g, gimple_location (stmt));
+		gsi_replace (gsi_p, g, true);
+	  }
+	else if (strcmp (name, "aligned_alloc") == 0)
+	  {
+		/*  May be we can also use this for new operator with
+		std::align_val_t parameter.  */
+		tree omp_alloc_type
+		  = build_function_type_list (ptr_type_node, size_type_node,
+	  size_type_node,
+	  pointer_sized_int_node,
+	  NULL_TREE);
+		tree repl = build_fn_decl ("omp_aligned_alloc",
+	   omp_alloc_type);
+		tree align = gimple_call_arg (gs, 0);
+		tree size = gimple_call_arg (gs, 1);
+		gimple *g = gimple_build_call (repl, 3, align, size,
+	   allocator);
+		gimple_call_set_lhs (g, gimple_call_lhs (gs));
+		gimple_set_location (g, gimple_location (stmt));
+		gsi_replace (gsi_p, g, true);
+	  }
+	else if ((strcmp (name, "calloc&

[PATCH 06/17] openmp: Add -foffload-memory


Add a new option.  It's inactive until I add some follow-up patches.

gcc/ChangeLog:

* common.opt: Add -foffload-memory and its enum values.
* coretypes.h (enum offload_memory): New.
* doc/invoke.texi: Document -foffload-memory.
---
 gcc/common.opt  | 16 
 gcc/coretypes.h |  7 +++
 gcc/doc/invoke.texi | 16 +++-
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index e7a51e882ba..8d76980fbbb 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2213,6 +2213,22 @@ Enum(offload_abi) String(ilp32) Value(OFFLOAD_ABI_ILP32)
 EnumValue
 Enum(offload_abi) String(lp64) Value(OFFLOAD_ABI_LP64)
 
+foffload-memory=
+Common Joined RejectNegative Enum(offload_memory) Var(flag_offload_memory) Init(OFFLOAD_MEMORY_NONE)
+-foffload-memory=[none|unified|pinned]	Use an offload memory optimization.
+
+Enum
+Name(offload_memory) Type(enum offload_memory) UnknownError(Unknown offload memory option %qs)
+
+EnumValue
+Enum(offload_memory) String(none) Value(OFFLOAD_MEMORY_NONE)
+
+EnumValue
+Enum(offload_memory) String(unified) Value(OFFLOAD_MEMORY_UNIFIED)
+
+EnumValue
+Enum(offload_memory) String(pinned) Value(OFFLOAD_MEMORY_PINNED)
+
 fomit-frame-pointer
 Common Var(flag_omit_frame_pointer) Optimization
 When possible do not generate stack frames.
diff --git a/gcc/coretypes.h b/gcc/coretypes.h
index 08b9ac9094c..dd52d5bb113 100644
--- a/gcc/coretypes.h
+++ b/gcc/coretypes.h
@@ -206,6 +206,13 @@ enum offload_abi {
   OFFLOAD_ABI_ILP32
 };
 
+/* Types of memory optimization for an offload device.  */
+enum offload_memory {
+  OFFLOAD_MEMORY_NONE,
+  OFFLOAD_MEMORY_UNIFIED,
+  OFFLOAD_MEMORY_PINNED
+};
+
 /* Types of profile update methods.  */
 enum profile_update {
   PROFILE_UPDATE_SINGLE,
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index d5ff1018372..3df39bb06e3 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -202,7 +202,7 @@ in the following sections.
 -fno-builtin  -fno-builtin-@var{function}  -fcond-mismatch @gol
 -ffreestanding  -fgimple  -fgnu-tm  -fgnu89-inline  -fhosted @gol
 -flax-vector-conversions  -fms-extensions @gol
--foffload=@var{arg}  -foffload-options=@var{arg} @gol
+-foffload=@var{arg}  -foffload-options=@var{arg} -foffload-memory=@var{arg} @gol
 -fopenacc  -fopenacc-dim=@var{geom} @gol
 -fopenmp  -fopenmp-simd @gol
 -fpermitted-flt-eval-methods=@var{standard} @gol
@@ -2708,6 +2708,20 @@ Typical command lines are
 -foffload-options=amdgcn-amdhsa=-march=gfx906 -foffload-options=-lm
 @end smallexample
 
+@item -foffload-memory=none
+@itemx -foffload-memory=unified
+@itemx -foffload-memory=pinned
+@opindex foffload-memory
+@cindex OpenMP offloading memory modes
+Enable a memory optimization mode to use with OpenMP.  The default behavior,
+@option{-foffload-memory=none}, is to do nothing special (unless enabled via
+a requires directive in the code).  @option{-foffload-memory=unified} is
+equivalent to @code{#pragma omp requires unified_shared_memory}.
+@option{-foffload-memory=pinned} forces all host memory to be pinned (this
+mode may require the user to increase the ulimit setting for locked memory).
+All translation units must select the same setting to avoid undefined
+behavior.
+
 @item -fopenacc
 @opindex fopenacc
 @cindex OpenACC accelerator programming

[PATCH 05/17] openmp, nvptx: ompx_unified_shared_mem_alloc


This adds support for using Cuda Managed Memory with omp_alloc.  It will be
used as the underpinnings for "requires unified_shared_memory" in a later
patch.

There are two new predefined allocators, ompx_unified_shared_mem_alloc and
ompx_host_mem_alloc, plus corresponding memory spaces, which can be used to
allocate memory in the "managed" space and explicitly on the host (it is
intended that "malloc" will be intercepted by the compiler).

The nvptx plugin is modified to make the necessary Cuda calls, and libgomp
is modified to switch to shared-memory mode for USM allocated mappings.

include/ChangeLog:

* cuda/cuda.h (CUdevice_attribute): Add definitions for
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR and
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR.
(CUmemAttach_flags): New.
(CUpointer_attribute): New.
(cuMemAllocManaged): New prototype.
(cuPointerGetAttribute): New prototype.

libgomp/ChangeLog:

* allocator.c (omp_max_predefined_alloc): Update.
(omp_aligned_alloc): Don't fallback ompx_host_mem_alloc.
(omp_aligned_calloc): Likewise.
(omp_realloc): Likewise.
* config/linux/allocator.c (linux_memspace_alloc): Handle USM.
(linux_memspace_calloc): Handle USM.
(linux_memspace_free): Handle USM.
(linux_memspace_realloc): Handle USM.
* config/nvptx/allocator.c (nvptx_memspace_alloc): Reject
ompx_host_mem_alloc.
(nvptx_memspace_calloc): Likewise.
(nvptx_memspace_realloc): Likewise.
* libgomp-plugin.h (GOMP_OFFLOAD_usm_alloc): New prototype.
(GOMP_OFFLOAD_usm_free): New prototype.
(GOMP_OFFLOAD_is_usm_ptr): New prototype.
* libgomp.h (gomp_usm_alloc): New prototype.
(gomp_usm_free): New prototype.
(gomp_is_usm_ptr): New prototype.
(struct gomp_device_descr): Add USM functions.
* omp.h.in (omp_memspace_handle_t): Add ompx_unified_shared_mem_space
and ompx_host_mem_space.
(omp_allocator_handle_t): Add ompx_unified_shared_mem_alloc and
ompx_host_mem_alloc.
* omp_lib.f90.in: Likewise.
* plugin/cuda-lib.def (cuMemAllocManaged): Add new call.
(cuPointerGetAttribute): Likewise.
* plugin/plugin-nvptx.c (nvptx_alloc): Add "usm" parameter.
Call cuMemAllocManaged as appropriate.
(GOMP_OFFLOAD_get_num_devices): Allow GOMP_REQUIRES_UNIFIED_ADDRESS
and GOMP_REQUIRES_UNIFIED_SHARED_MEMORY.
(GOMP_OFFLOAD_alloc): Move internals to ...
(GOMP_OFFLOAD_alloc_1): ... this, and add usm parameter.
(GOMP_OFFLOAD_usm_alloc): New function.
(GOMP_OFFLOAD_usm_free): New function.
(GOMP_OFFLOAD_is_usm_ptr): New function.
* target.c (gomp_map_vars_internal): Add USM support.
(gomp_usm_alloc): New function.
(gomp_usm_free): New function.
(gomp_load_plugin_for_device): New function.
* testsuite/libgomp.c/usm-1.c: New test.
* testsuite/libgomp.c/usm-2.c: New test.
* testsuite/libgomp.c/usm-3.c: New test.
* testsuite/libgomp.c/usm-4.c: New test.
* testsuite/libgomp.c/usm-5.c: New test.

co-authored-by: Kwok Cheung Yeung  

squash! openmp, nvptx: ompx_unified_shared_mem_alloc
---
 include/cuda/cuda.h | 12 ++
 libgomp/allocator.c | 13 --
 libgomp/config/linux/allocator.c| 48 ++
 libgomp/config/nvptx/allocator.c|  6 +++
 libgomp/libgomp-plugin.h|  3 ++
 libgomp/libgomp.h   |  6 +++
 libgomp/omp.h.in|  4 ++
 libgomp/omp_lib.f90.in  |  8 
 libgomp/plugin/cuda-lib.def |  2 +
 libgomp/plugin/plugin-nvptx.c   | 47 ++---
 libgomp/target.c| 64 +
 libgomp/testsuite/libgomp.c/usm-1.c | 24 +++
 libgomp/testsuite/libgomp.c/usm-2.c | 32 +++
 libgomp/testsuite/libgomp.c/usm-3.c | 35 
 libgomp/testsuite/libgomp.c/usm-4.c | 36 
 libgomp/testsuite/libgomp.c/usm-5.c | 28 +
 16 files changed, 340 insertions(+), 28 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c/usm-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-5.c

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05d150..8135e7c9247 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,9 +77,19 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
+  CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = 75,
+  CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR = 76,
   CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82

[PATCH 12/17] Handle cleanup of omp allocated variables (OpenMP 5.0).


Currently we are only handling omp allocate directive that is associated
with an allocate statement.  This statement results in malloc and free calls.
The malloc calls are easy to get to as they are in the same block as allocate
directive.  But the free calls come in a separate cleanup block.  To help any
later passes finding them, an allocate directive is generated in the
cleanup block with kind=free. The normal allocate directive is given
kind=allocate.

gcc/fortran/ChangeLog:

* gfortran.h (struct access_ref): Declare new members
omp_allocated and omp_allocated_end.
* openmp.cc (gfc_match_omp_allocate): Set new_st.resolved_sym to
NULL.
(prepare_omp_allocated_var_list_for_cleanup): New function.
(gfc_resolve_omp_allocate): Call it.
* trans-decl.cc (gfc_trans_deferred_vars): Process omp_allocated.
* trans-openmp.cc (gfc_trans_omp_allocate): Set kind for the stmt
generated for allocate directive.

gcc/ChangeLog:

* tree-core.h (struct tree_base): Add comments.
* tree-pretty-print.cc (dump_generic_node): Handle allocate directive
kind.
* tree.h (OMP_ALLOCATE_KIND_ALLOCATE): New define.
(OMP_ALLOCATE_KIND_FREE): Likewise.

gcc/testsuite/ChangeLog:

* gfortran.dg/gomp/allocate-6.f90: Test kind of allocate directive.
---
 gcc/fortran/gfortran.h|  1 +
 gcc/fortran/openmp.cc | 30 +++
 gcc/fortran/trans-decl.cc | 20 +
 gcc/fortran/trans-openmp.cc   |  6 
 gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 |  3 +-
 gcc/tree-core.h   |  6 
 gcc/tree-pretty-print.cc  |  4 +++
 gcc/tree.h|  4 +++
 8 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/gcc/fortran/gfortran.h b/gcc/fortran/gfortran.h
index 755469185a6..c6f58341cf3 100644
--- a/gcc/fortran/gfortran.h
+++ b/gcc/fortran/gfortran.h
@@ -1829,6 +1829,7 @@ typedef struct gfc_symbol
   gfc_array_spec *as;
   struct gfc_symbol *result;	/* function result symbol */
   gfc_component *components;	/* Derived type components */
+  gfc_omp_namelist *omp_allocated, *omp_allocated_end;
 
   /* Defined only for Cray pointees; points to their pointer.  */
   struct gfc_symbol *cp_pointer;
diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc
index 38003890bb0..4c94bc763b5 100644
--- a/gcc/fortran/openmp.cc
+++ b/gcc/fortran/openmp.cc
@@ -6057,6 +6057,7 @@ gfc_match_omp_allocate (void)
 
   new_st.op = EXEC_OMP_ALLOCATE;
   new_st.ext.omp_clauses = c;
+  new_st.resolved_sym = NULL;
   gfc_free_expr (allocator);
   return MATCH_YES;
 }
@@ -9548,6 +9549,34 @@ gfc_resolve_oacc_routines (gfc_namespace *ns)
 }
 }
 
+static void
+prepare_omp_allocated_var_list_for_cleanup (gfc_omp_namelist *cn, locus loc)
+{
+  gfc_symbol *proc = cn->sym->ns->proc_name;
+  gfc_omp_namelist *p, *n;
+
+  for (n = cn; n; n = n->next)
+{
+  if (n->sym->attr.allocatable && !n->sym->attr.save
+	  && !n->sym->attr.result && !proc->attr.is_main_program)
+	{
+	  p = gfc_get_omp_namelist ();
+	  p->sym = n->sym;
+	  p->expr = gfc_copy_expr (n->expr);
+	  p->where = loc;
+	  p->next = NULL;
+	  if (proc->omp_allocated == NULL)
+	proc->omp_allocated_end = proc->omp_allocated = p;
+	  else
+	{
+	  proc->omp_allocated_end->next = p;
+	  proc->omp_allocated_end = p;
+	}
+
+	}
+}
+}
+
 static void
 check_allocate_directive_restrictions (gfc_symbol *sym, gfc_expr *omp_al,
    gfc_namespace *ns, locus loc)
@@ -9678,6 +9707,7 @@ gfc_resolve_omp_allocate (gfc_code *code, gfc_namespace *ns)
 		 code->loc);
 	}
 }
+  prepare_omp_allocated_var_list_for_cleanup (cn, code->loc);
 }
 
 
diff --git a/gcc/fortran/trans-decl.cc b/gcc/fortran/trans-decl.cc
index 6493cc2f6b1..326365f22fc 100644
--- a/gcc/fortran/trans-decl.cc
+++ b/gcc/fortran/trans-decl.cc
@@ -4588,6 +4588,26 @@ gfc_trans_deferred_vars (gfc_symbol * proc_sym, gfc_wrapped_block * block)
 	  }
 }
 
+  /* Generate a dummy allocate pragma with free kind so that cleanup
+ of those variables which were allocated using the allocate statement
+ associated with an allocate clause happens correctly.  */
+
+  if (proc_sym->omp_allocated)
+{
+  gfc_clear_new_st ();
+  new_st.op = EXEC_OMP_ALLOCATE;
+  gfc_omp_clauses *c = gfc_get_omp_clauses ();
+  c->lists[OMP_LIST_ALLOCATOR] = proc_sym->omp_allocated;
+  new_st.ext.omp_clauses = c;
+  /* This is just a hacky way to convey to handler that we are
+	 dealing with cleanup here.  Saves us from using another field
+	 for it.  */
+  new_st.resolved_sym = proc_sym->omp_allocated->sym;
+  gfc_add_init_cleanup (block, NULL,
+			gfc_trans_omp_directive (&new_st));
+  gfc_free_omp_clauses (c);
+  proc_sym->omp_allocated = NULL;
+}
 
   /* Initialize the INTENT(OUT) derived type dummy argu

[PATCH 07/17] openmp: allow requires unified_shared_memory


This is the front-end portion of the Unified Shared Memory implementation.
It removes the "sorry, unimplemented message" in C, C++, and Fortran, and sets
flag_offload_memory, but is otherwise inactive, for now.

It also checks that -foffload-memory isn't set to an incompatible mode.

gcc/c/ChangeLog:

* c-parser.cc (c_parser_omp_requires): Allow "requires
  unified_share_memory" and "unified_address".

gcc/cp/ChangeLog:

* parser.cc (cp_parser_omp_requires): Allow "requires
unified_share_memory" and "unified_address".

gcc/fortran/ChangeLog:

* openmp.cc (gfc_match_omp_requires): Allow "requires
unified_share_memory" and "unified_address".

gcc/testsuite/ChangeLog:

* c-c++-common/gomp/usm-1.c: New test.
* c-c++-common/gomp/usm-4.c: New test.
* gfortran.dg/gomp/usm-1.f90: New test.
* gfortran.dg/gomp/usm-4.f90: New test.
---
 gcc/c/c-parser.cc| 22 --
 gcc/cp/parser.cc | 22 --
 gcc/fortran/openmp.cc| 13 +
 gcc/testsuite/c-c++-common/gomp/usm-1.c  |  4 
 gcc/testsuite/c-c++-common/gomp/usm-4.c  |  4 
 gcc/testsuite/gfortran.dg/gomp/usm-1.f90 |  6 ++
 gcc/testsuite/gfortran.dg/gomp/usm-4.f90 |  6 ++
 7 files changed, 73 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-1.c
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-4.c
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-1.f90
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-4.f90

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 9c02141e2c6..c30f67cd2da 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -22726,9 +22726,27 @@ c_parser_omp_requires (c_parser *parser)
 	  enum omp_requires this_req = (enum omp_requires) 0;
 
 	  if (!strcmp (p, "unified_address"))
-	this_req = OMP_REQUIRES_UNIFIED_ADDRESS;
+	{
+	  this_req = OMP_REQUIRES_UNIFIED_ADDRESS;
+
+	  if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+		  && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+		error_at (cloc,
+			  "% is incompatible with the "
+			  "selected %<-foffload-memory%> option");
+	  flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
+	}
 	  else if (!strcmp (p, "unified_shared_memory"))
-	this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
+	{
+	  this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
+
+	  if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+		  && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+		error_at (cloc,
+			  "% is incompatible with the "
+			  "selected %<-foffload-memory%> option");
+	  flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
+	}
 	  else if (!strcmp (p, "dynamic_allocators"))
 	this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS;
 	  else if (!strcmp (p, "reverse_offload"))
diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
index df657a3fb2b..3deafc7c928 100644
--- a/gcc/cp/parser.cc
+++ b/gcc/cp/parser.cc
@@ -46860,9 +46860,27 @@ cp_parser_omp_requires (cp_parser *parser, cp_token *pragma_tok)
 	  enum omp_requires this_req = (enum omp_requires) 0;
 
 	  if (!strcmp (p, "unified_address"))
-	this_req = OMP_REQUIRES_UNIFIED_ADDRESS;
+	{
+	  this_req = OMP_REQUIRES_UNIFIED_ADDRESS;
+
+	  if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+		  && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+		error_at (cloc,
+			  "% is incompatible with the "
+			  "selected %<-foffload-memory%> option");
+	  flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
+	}
 	  else if (!strcmp (p, "unified_shared_memory"))
-	this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
+	{
+	  this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
+
+	  if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+		  && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+		error_at (cloc,
+			  "% is incompatible with the "
+			  "selected %<-foffload-memory%> option");
+	  flag_offload_memory = OFFLOAD_MEMORY_UNIFIED;
+	}
 	  else if (!strcmp (p, "dynamic_allocators"))
 	this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS;
 	  else if (!strcmp (p, "reverse_offload"))
diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc
index bd4ff259fe0..91bf8a3c50d 100644
--- a/gcc/fortran/openmp.cc
+++ b/gcc/fortran/openmp.cc
@@ -29,6 +29,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "diagnostic.h"
 #include "gomp-constants.h"
 #include "target-memory.h"  /* For gfc_encode_character.  */
+#include "options.h"
 
 /* Match an end of OpenMP directive.  End of OpenMP directive is optional
whitespace, followed by '\n' or comment '!'.  */
@@ -5556,6 +5557,12 @@ gfc_match_omp_requires (void)
 	  requires_clause = OMP_REQ_UNIFIED_ADDRESS;
 	  if (requires_clauses & OMP_REQ_UNIFIED_ADDRESS)
 	goto duplicate_clause;
+
+	  if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+	  && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+	gfc_error_now ("unified_address

[PATCH 11/17] Translate allocate directive (OpenMP 5.0).


gcc/fortran/ChangeLog:

* trans-openmp.cc (gfc_trans_omp_clauses): Handle OMP_LIST_ALLOCATOR.
(gfc_trans_omp_allocate): New function.
(gfc_trans_omp_directive): Handle EXEC_OMP_ALLOCATE.

gcc/ChangeLog:

* tree-pretty-print.cc (dump_omp_clause): Handle OMP_CLAUSE_ALLOCATOR.
(dump_generic_node): Handle OMP_ALLOCATE.
* tree.def (OMP_ALLOCATE): New.
* tree.h (OMP_ALLOCATE_CLAUSES): Likewise.
(OMP_ALLOCATE_DECL): Likewise.
(OMP_ALLOCATE_ALLOCATOR): Likewise.
* tree.cc (omp_clause_num_ops): Add entry for OMP_CLAUSE_ALLOCATOR.

gcc/testsuite/ChangeLog:

* gfortran.dg/gomp/allocate-6.f90: New test.
---
 gcc/fortran/trans-openmp.cc   | 44 
 gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 | 72 +++
 gcc/tree-core.h   |  3 +
 gcc/tree-pretty-print.cc  | 19 +
 gcc/tree.cc   |  1 +
 gcc/tree.def  |  4 ++
 gcc/tree.h| 11 +++
 7 files changed, 154 insertions(+)
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/allocate-6.f90

diff --git a/gcc/fortran/trans-openmp.cc b/gcc/fortran/trans-openmp.cc
index de27ed52c02..3ee63e416ed 100644
--- a/gcc/fortran/trans-openmp.cc
+++ b/gcc/fortran/trans-openmp.cc
@@ -2728,6 +2728,28 @@ gfc_trans_omp_clauses (stmtblock_t *block, gfc_omp_clauses *clauses,
 		  }
 	  }
 	  break;
+	case OMP_LIST_ALLOCATOR:
+	  for (; n != NULL; n = n->next)
+	if (n->sym->attr.referenced)
+	  {
+		tree t = gfc_trans_omp_variable (n->sym, false);
+		if (t != error_mark_node)
+		  {
+		tree node = build_omp_clause (input_location,
+		  OMP_CLAUSE_ALLOCATOR);
+		OMP_ALLOCATE_DECL (node) = t;
+		if (n->expr)
+		  {
+			tree allocator_;
+			gfc_init_se (&se, NULL);
+			gfc_conv_expr (&se, n->expr);
+			allocator_ = gfc_evaluate_now (se.expr, block);
+			OMP_ALLOCATE_ALLOCATOR (node) = allocator_;
+		  }
+		omp_clauses = gfc_trans_add_clause (node, omp_clauses);
+		  }
+	  }
+	  break;
 	case OMP_LIST_LINEAR:
 	  {
 	gfc_expr *last_step_expr = NULL;
@@ -4982,6 +5004,26 @@ gfc_trans_omp_atomic (gfc_code *code)
   return gfc_finish_block (&block);
 }
 
+static tree
+gfc_trans_omp_allocate (gfc_code *code)
+{
+  stmtblock_t block;
+  tree stmt;
+
+  gfc_omp_clauses *clauses = code->ext.omp_clauses;
+  gcc_assert (clauses);
+
+  gfc_start_block (&block);
+  stmt = make_node (OMP_ALLOCATE);
+  TREE_TYPE (stmt) = void_type_node;
+  OMP_ALLOCATE_CLAUSES (stmt) = gfc_trans_omp_clauses (&block, clauses,
+		   code->loc, false,
+		   true);
+  gfc_add_expr_to_block (&block, stmt);
+  gfc_merge_block_scope (&block);
+  return gfc_finish_block (&block);
+}
+
 static tree
 gfc_trans_omp_barrier (void)
 {
@@ -7488,6 +7530,8 @@ gfc_trans_omp_directive (gfc_code *code)
 {
   switch (code->op)
 {
+case EXEC_OMP_ALLOCATE:
+  return gfc_trans_omp_allocate (code);
 case EXEC_OMP_ATOMIC:
   return gfc_trans_omp_atomic (code);
 case EXEC_OMP_BARRIER:
diff --git a/gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 b/gcc/testsuite/gfortran.dg/gomp/allocate-6.f90
new file mode 100644
index 000..2de2b52ee44
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/gomp/allocate-6.f90
@@ -0,0 +1,72 @@
+! { dg-do compile }
+! { dg-additional-options "-fdump-tree-original" }
+
+module omp_lib_kinds
+  use iso_c_binding, only: c_int, c_intptr_t
+  implicit none
+  private :: c_int, c_intptr_t
+  integer, parameter :: omp_allocator_handle_kind = c_intptr_t
+
+  integer (kind=omp_allocator_handle_kind), &
+ parameter :: omp_null_allocator = 0
+  integer (kind=omp_allocator_handle_kind), &
+ parameter :: omp_default_mem_alloc = 1
+  integer (kind=omp_allocator_handle_kind), &
+ parameter :: omp_large_cap_mem_alloc = 2
+  integer (kind=omp_allocator_handle_kind), &
+ parameter :: omp_const_mem_alloc = 3
+  integer (kind=omp_allocator_handle_kind), &
+ parameter :: omp_high_bw_mem_alloc = 4
+  integer (kind=omp_allocator_handle_kind), &
+ parameter :: omp_low_lat_mem_alloc = 5
+  integer (kind=omp_allocator_handle_kind), &
+ parameter :: omp_cgroup_mem_alloc = 6
+  integer (kind=omp_allocator_handle_kind), &
+ parameter :: omp_pteam_mem_alloc = 7
+  integer (kind=omp_allocator_handle_kind), &
+ parameter :: omp_thread_mem_alloc = 8
+end module
+
+
+subroutine foo(x, y, al)
+  use omp_lib_kinds
+  implicit none
+  
+type :: my_type
+  integer :: i
+  integer :: j
+  real :: x
+end type
+
+  integer  :: x
+  integer  :: y
+  integer (kind=omp_allocator_handle_kind) :: al
+
+  integer, allocatable :: var1
+  integer, allocatable :: var2
+  real, allocatable :: var3(:,:)
+  type (my_type), allocatable :: var4
+  integer, pointer :: pii, parr(:)
+
+  character, allocatable :: str1a, str1aarr(:) 
+  character(len=5), allocatable :: str5a, str5aarr(:)
+  
+  !$

[PATCH 14/17] Lower allocate directive (OpenMP 5.0).


This patch looks for malloc/free calls that were generated by allocate statement
that is associated with allocate directive and replaces them with GOMP_alloc
and GOMP_free.

gcc/ChangeLog:

* omp-low.cc (scan_sharing_clauses): Handle OMP_CLAUSE_ALLOCATOR.
(scan_omp_allocate): New.
(scan_omp_1_stmt): Call it.
(lower_omp_allocate): New function.
(lower_omp_1): Call it.

gcc/testsuite/ChangeLog:

* gfortran.dg/gomp/allocate-6.f90: Add tests.
* gfortran.dg/gomp/allocate-7.f90: New test.
* gfortran.dg/gomp/allocate-8.f90: New test.

libgomp/ChangeLog:

* testsuite/libgomp.fortran/allocate-2.f90: New test.
---
 gcc/omp-low.cc| 139 ++
 gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 |   9 ++
 gcc/testsuite/gfortran.dg/gomp/allocate-7.f90 |  13 ++
 gcc/testsuite/gfortran.dg/gomp/allocate-8.f90 |  15 ++
 .../testsuite/libgomp.fortran/allocate-2.f90  |  48 ++
 5 files changed, 224 insertions(+)
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/allocate-7.f90
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/allocate-8.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/allocate-2.f90

diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index cdadd6f0c96..7d1a2a0d795 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -1746,6 +1746,7 @@ scan_sharing_clauses (tree clauses, omp_context *ctx)
 	case OMP_CLAUSE_FINALIZE:
 	case OMP_CLAUSE_TASK_REDUCTION:
 	case OMP_CLAUSE_ALLOCATE:
+	case OMP_CLAUSE_ALLOCATOR:
 	  break;
 
 	case OMP_CLAUSE_ALIGNED:
@@ -1963,6 +1964,7 @@ scan_sharing_clauses (tree clauses, omp_context *ctx)
 	case OMP_CLAUSE_FINALIZE:
 	case OMP_CLAUSE_FILTER:
 	case OMP_CLAUSE__CONDTEMP_:
+	case OMP_CLAUSE_ALLOCATOR:
 	  break;
 
 	case OMP_CLAUSE__CACHE_:
@@ -3033,6 +3035,16 @@ scan_omp_simd_scan (gimple_stmt_iterator *gsi, gomp_for *stmt,
   maybe_lookup_ctx (new_stmt)->for_simd_scan_phase = true;
 }
 
+/* Scan an OpenMP allocate directive.  */
+
+static void
+scan_omp_allocate (gomp_allocate *stmt, omp_context *outer_ctx)
+{
+  omp_context *ctx;
+  ctx = new_omp_context (stmt, outer_ctx);
+  scan_sharing_clauses (gimple_omp_allocate_clauses (stmt), ctx);
+}
+
 /* Scan an OpenMP sections directive.  */
 
 static void
@@ -4332,6 +4344,9 @@ scan_omp_1_stmt (gimple_stmt_iterator *gsi, bool *handled_ops_p,
 	insert_decl_map (&ctx->cb, var, var);
   }
   break;
+case GIMPLE_OMP_ALLOCATE:
+  scan_omp_allocate (as_a  (stmt), ctx);
+  break;
 default:
   *handled_ops_p = false;
   break;
@@ -8768,6 +8783,125 @@ lower_omp_single_simple (gomp_single *single_stmt, gimple_seq *pre_p)
   gimple_seq_add_stmt (pre_p, gimple_build_label (flabel));
 }
 
+static void
+lower_omp_allocate (gimple_stmt_iterator *gsi_p, omp_context *ctx)
+{
+  gomp_allocate *st = as_a  (gsi_stmt (*gsi_p));
+  tree clauses = gimple_omp_allocate_clauses (st);
+  int kind = gimple_omp_allocate_kind (st);
+  gcc_assert (kind == GF_OMP_ALLOCATE_KIND_ALLOCATE
+	  || kind == GF_OMP_ALLOCATE_KIND_FREE);
+
+  for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+{
+  if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_ALLOCATOR)
+	continue;
+
+  bool allocate = (kind == GF_OMP_ALLOCATE_KIND_ALLOCATE);
+  /* The allocate directives that appear in a target region must specify
+	 an allocator clause unless a requires directive with the
+	 dynamic_allocators clause is present in the same compilation unit.  */
+  if (OMP_ALLOCATE_ALLOCATOR (c) == NULL_TREE
+	  && ((omp_requires_mask & OMP_REQUIRES_DYNAMIC_ALLOCATORS) == 0)
+	  && omp_maybe_offloaded_ctx (ctx))
+	error_at (OMP_CLAUSE_LOCATION (c), "% directive must"
+		  " specify an allocator here");
+
+  tree var = OMP_ALLOCATE_DECL (c);
+
+  gimple_stmt_iterator gsi = *gsi_p;
+  for (gsi_next (&gsi); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (gimple_code (stmt) != GIMPLE_CALL
+	  || (allocate && gimple_call_fndecl (stmt)
+		  != builtin_decl_explicit (BUILT_IN_MALLOC))
+	  || (!allocate && gimple_call_fndecl (stmt)
+		  != builtin_decl_explicit (BUILT_IN_FREE)))
+	continue;
+	  const gcall *gs = as_a  (stmt);
+	  tree allocator = OMP_ALLOCATE_ALLOCATOR (c)
+			   ? OMP_ALLOCATE_ALLOCATOR (c)
+			   : integer_zero_node;
+	  if (allocate)
+	{
+	  tree lhs = gimple_call_lhs (gs);
+	  if (lhs && TREE_CODE (lhs) == SSA_NAME)
+		{
+		  gimple_stmt_iterator gsi2 = gsi;
+		  gsi_next (&gsi2);
+		  gimple *assign = gsi_stmt (gsi2);
+		  if (gimple_code (assign) == GIMPLE_ASSIGN)
+		{
+		  lhs = gimple_assign_lhs (as_a  (assign));
+		  if (lhs == NULL_TREE
+			  || TREE_CODE (lhs) != COMPONENT_REF)
+			continue;
+		  lhs = TREE_OPERAND (lhs, 0);
+		}
+		}
+
+	  if (lhs == var)
+		{
+		  unsigned HOST_WIDE_INT ialign = 0;
+		  tree align;
+		  if (TYPE_P (var))
+		ialign = TYPE_ALIGN_UNIT (var);
+		  else
+		ialign = DECL_ALIGN_UNIT (var);

[PATCH 08/17] openmp: -foffload-memory=pinned


Implement the -foffload-memory=pinned option such that libgomp is
instructed to enable fully-pinned memory at start-up.  The option is
intended to provide a performance boost to certain offload programs without
modifying the code.

This feature only works on Linux, at present, and simply calls mlockall to
enable always-on memory pinning.  It requires that the ulimit feature is
set high enough to accommodate all the program's memory usage.

In this mode the ompx_pinned_memory_alloc feature is disabled as it is not
needed and may conflict.

gcc/ChangeLog:

* omp-builtins.def (BUILT_IN_GOMP_ENABLE_PINNED_MODE): New.
* omp-low.cc (omp_enable_pinned_mode): New function.
(execute_lower_omp): Call omp_enable_pinned_mode.

libgomp/ChangeLog:

* config/linux/allocator.c (always_pinned_mode): New variable.
(GOMP_enable_pinned_mode): New function.
(linux_memspace_alloc): Disable pinning when always_pinned_mode set.
(linux_memspace_calloc): Likewise.
(linux_memspace_free): Likewise.
(linux_memspace_realloc): Likewise.
* libgomp.map: Add GOMP_enable_pinned_mode.
* testsuite/libgomp.c/alloc-pinned-7.c: New test.

gcc/testsuite/ChangeLog:

* c-c++-common/gomp/alloc-pinned-1.c: New test.
---
 gcc/omp-builtins.def  |  3 +
 gcc/omp-low.cc| 66 +++
 .../c-c++-common/gomp/alloc-pinned-1.c| 28 
 libgomp/config/linux/allocator.c  | 26 
 libgomp/libgomp.map   |  1 +
 libgomp/target.c  |  4 +-
 libgomp/testsuite/libgomp.c/alloc-pinned-7.c  | 63 ++
 7 files changed, 190 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-7.c

diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def
index ee5213eedcf..276dd7812f2 100644
--- a/gcc/omp-builtins.def
+++ b/gcc/omp-builtins.def
@@ -470,3 +470,6 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_WARNING, "GOMP_warning",
 		  BT_FN_VOID_CONST_PTR_SIZE, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_ERROR, "GOMP_error",
 		  BT_FN_VOID_CONST_PTR_SIZE, ATTR_COLD_NORETURN_NOTHROW_LEAF_LIST)
+DEF_GOMP_BUILTIN (BUILT_IN_GOMP_ENABLE_PINNED_MODE,
+		  "GOMP_enable_pinned_mode",
+		  BT_FN_VOID, ATTR_NOTHROW_LIST)
diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index d73c165f029..ba612e5c67d 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -14620,6 +14620,68 @@ lower_omp (gimple_seq *body, omp_context *ctx)
   input_location = saved_location;
 }
 
+/* Emit a constructor function to enable -foffload-memory=pinned
+   at runtime.  Libgomp handles the OS mode setting, but we need to trigger
+   it by calling GOMP_enable_pinned mode before the program proper runs.  */
+
+static void
+omp_enable_pinned_mode ()
+{
+  static bool visited = false;
+  if (visited)
+return;
+  visited = true;
+
+  /* Create a new function like this:
+ 
+   static void __attribute__((constructor))
+   __set_pinned_mode ()
+   {
+ GOMP_enable_pinned_mode ();
+   }
+  */
+
+  tree name = get_identifier ("__set_pinned_mode");
+  tree voidfntype = build_function_type_list (void_type_node, NULL_TREE);
+  tree decl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL, name, voidfntype);
+
+  TREE_STATIC (decl) = 1;
+  TREE_USED (decl) = 1;
+  DECL_ARTIFICIAL (decl) = 1;
+  DECL_IGNORED_P (decl) = 0;
+  TREE_PUBLIC (decl) = 0;
+  DECL_UNINLINABLE (decl) = 1;
+  DECL_EXTERNAL (decl) = 0;
+  DECL_CONTEXT (decl) = NULL_TREE;
+  DECL_INITIAL (decl) = make_node (BLOCK);
+  BLOCK_SUPERCONTEXT (DECL_INITIAL (decl)) = decl;
+  DECL_STATIC_CONSTRUCTOR (decl) = 1;
+  DECL_ATTRIBUTES (decl) = tree_cons (get_identifier ("constructor"),
+  NULL_TREE, NULL_TREE);
+
+  tree t = build_decl (UNKNOWN_LOCATION, RESULT_DECL, NULL_TREE,
+		   void_type_node);
+  DECL_ARTIFICIAL (t) = 1;
+  DECL_IGNORED_P (t) = 1;
+  DECL_CONTEXT (t) = decl;
+  DECL_RESULT (decl) = t;
+
+  push_struct_function (decl);
+  init_tree_ssa (cfun);
+
+  tree calldecl = builtin_decl_explicit (BUILT_IN_GOMP_ENABLE_PINNED_MODE);
+  gcall *call = gimple_build_call (calldecl, 0);
+
+  gimple_seq seq = NULL;
+  gimple_seq_add_stmt (&seq, call);
+  gimple_set_body (decl, gimple_build_bind (NULL_TREE, seq, NULL));
+
+  cfun->function_end_locus = UNKNOWN_LOCATION;
+  cfun->curr_properties |= PROP_gimple_any;
+  pop_cfun ();
+  cgraph_node::add_new_function (decl, true);
+}
+
 /* Main entry point.  */
 
 static unsigned int
@@ -14676,6 +14738,10 @@ execute_lower_omp (void)
   for (auto task_stmt : task_cpyfns)
 finalize_task_copyfn (task_stmt);
   task_cpyfns.release ();
+
+  if (flag_offload_memory == OFFLOAD_MEMORY_PINNED)
+omp_enable_pinned_mode ();
+
   return 0;
 }
 
diff --git a/gcc/testsuite/c-c++-common/gomp/alloc-pinned-1.c b/gcc/testsuite/c-c++-common/gomp/alloc-pi

[PATCH 13/17] Gimplify allocate directive (OpenMP 5.0).


gcc/ChangeLog:

* doc/gimple.texi: Describe GIMPLE_OMP_ALLOCATE.
* gimple-pretty-print.cc (dump_gimple_omp_allocate): New function.
(pp_gimple_stmt_1): Call it.
* gimple.cc (gimple_build_omp_allocate): New function.
* gimple.def (GIMPLE_OMP_ALLOCATE): New node.
* gimple.h (enum gf_mask): Add GF_OMP_ALLOCATE_KIND_MASK,
GF_OMP_ALLOCATE_KIND_ALLOCATE and GF_OMP_ALLOCATE_KIND_FREE.
(struct gomp_allocate): New.
(is_a_helper ::test): New.
(is_a_helper ::test): New.
(gimple_build_omp_allocate): Declare.
(gimple_omp_subcode): Replace GIMPLE_OMP_TEAMS with
GIMPLE_OMP_ALLOCATE.
(gimple_omp_allocate_set_clauses): New.
(gimple_omp_allocate_set_kind): Likewise.
(gimple_omp_allocate_clauses): Likewise.
(gimple_omp_allocate_kind): Likewise.
(CASE_GIMPLE_OMP): Add GIMPLE_OMP_ALLOCATE.
* gimplify.cc (gimplify_omp_allocate): New.
(gimplify_expr): Call it.
* gsstruct.def (GSS_OMP_ALLOCATE): Define.

gcc/testsuite/ChangeLog:

* gfortran.dg/gomp/allocate-6.f90: Add tests.
---
 gcc/doc/gimple.texi   | 38 +++-
 gcc/gimple-pretty-print.cc| 37 
 gcc/gimple.cc | 12 
 gcc/gimple.def|  6 ++
 gcc/gimple.h  | 60 ++-
 gcc/gimplify.cc   | 19 ++
 gcc/gsstruct.def  |  1 +
 gcc/testsuite/gfortran.dg/gomp/allocate-6.f90 |  4 +-
 8 files changed, 173 insertions(+), 4 deletions(-)

diff --git a/gcc/doc/gimple.texi b/gcc/doc/gimple.texi
index dd9149377f3..67b9061f3a7 100644
--- a/gcc/doc/gimple.texi
+++ b/gcc/doc/gimple.texi
@@ -420,6 +420,9 @@ kinds, along with their relationships to @code{GSS_} values (layouts) and
  + gomp_continue
  |layout: GSS_OMP_CONTINUE, code: GIMPLE_OMP_CONTINUE
  |
+ + gomp_allocate
+ |layout: GSS_OMP_ALLOCATE, code: GIMPLE_OMP_ALLOCATE
+ |
  + gomp_atomic_load
  |layout: GSS_OMP_ATOMIC_LOAD, code: GIMPLE_OMP_ATOMIC_LOAD
  |
@@ -454,6 +457,7 @@ The following table briefly describes the GIMPLE instruction set.
 @item @code{GIMPLE_GOTO}		@tab x			@tab x
 @item @code{GIMPLE_LABEL}		@tab x			@tab x
 @item @code{GIMPLE_NOP}			@tab x			@tab x
+@item @code{GIMPLE_OMP_ALLOCATE}	@tab x			@tab x
 @item @code{GIMPLE_OMP_ATOMIC_LOAD}	@tab x			@tab x
 @item @code{GIMPLE_OMP_ATOMIC_STORE}	@tab x			@tab x
 @item @code{GIMPLE_OMP_CONTINUE}	@tab x			@tab x
@@ -1029,6 +1033,7 @@ Return a deep copy of statement @code{STMT}.
 * @code{GIMPLE_LABEL}::
 * @code{GIMPLE_GOTO}::
 * @code{GIMPLE_NOP}::
+* @code{GIMPLE_OMP_ALLOCATE}::
 * @code{GIMPLE_OMP_ATOMIC_LOAD}::
 * @code{GIMPLE_OMP_ATOMIC_STORE}::
 * @code{GIMPLE_OMP_CONTINUE}::
@@ -1729,6 +1734,38 @@ Build a @code{GIMPLE_NOP} statement.
 Returns @code{TRUE} if statement @code{G} is a @code{GIMPLE_NOP}.
 @end deftypefn
 
+@node @code{GIMPLE_OMP_ALLOCATE}
+@subsection @code{GIMPLE_OMP_ALLOCATE}
+@cindex @code{GIMPLE_OMP_ALLOCATE}
+
+@deftypefn {GIMPLE function} gomp_allocate *gimple_build_omp_allocate ( @
+tree clauses, int kind)
+Build a @code{GIMPLE_OMP_ALLOCATE} statement.  @code{CLAUSES} is the clauses
+associated with this node.  @code{KIND} is the enumeration value
+@code{GF_OMP_ALLOCATE_KIND_ALLOCATE} if this directive allocates memory
+or @code{GF_OMP_ALLOCATE_KIND_FREE} if it de-allocates.
+@end deftypefn
+
+@deftypefn {GIMPLE function} void gimple_omp_allocate_set_clauses ( @
+gomp_allocate *g, tree clauses)
+Set the @code{CLAUSES} for a @code{GIMPLE_OMP_ALLOCATE}.
+@end deftypefn
+
+@deftypefn {GIMPLE function} tree gimple_omp_aallocate_clauses ( @
+const gomp_allocate *g)
+Get the @code{CLAUSES} of a @code{GIMPLE_OMP_ALLOCATE}.
+@end deftypefn
+
+@deftypefn {GIMPLE function} void gimple_omp_allocate_set_kind ( @
+gomp_allocate *g, int kind)
+Set the @code{KIND} for a @code{GIMPLE_OMP_ALLOCATE}.
+@end deftypefn
+
+@deftypefn {GIMPLE function} tree gimple_omp_allocate_kind ( @
+const gomp_atomic_load *g)
+Get the @code{KIND} of a @code{GIMPLE_OMP_ALLOCATE}.
+@end deftypefn
+
 @node @code{GIMPLE_OMP_ATOMIC_LOAD}
 @subsection @code{GIMPLE_OMP_ATOMIC_LOAD}
 @cindex @code{GIMPLE_OMP_ATOMIC_LOAD}
@@ -1760,7 +1797,6 @@ const gomp_atomic_load *g)
 Get the @code{RHS} of an atomic set.
 @end deftypefn
 
-
 @node @code{GIMPLE_OMP_ATOMIC_STORE}
 @subsection @code{GIMPLE_OMP_ATOMIC_STORE}
 @cindex @code{GIMPLE_OMP_ATOMIC_STORE}
diff --git a/gcc/gimple-pretty-print.cc b/gcc/gimple-pretty-print.cc
index ebd87b20a0a..bb961a900df 100644
--- a/gcc/gimple-pretty-print.cc
+++ b/gcc/gimple-pretty-print.cc
@@ -1967,6 +1967,38 @@ dump_gimple_omp_critical (pretty_printer *buffer, const gomp_critical *gs,
 }
 }
 
+static void
+dump_gimple_omp_allocate (pretty_printer *buffer, const gomp_allocate *gs,
+			  int spc, dump_flags_t fl

[PATCH 10/17] Add parsing support for allocate directive (OpenMP 5.0)


Currently we only make use of this directive when it is associated
with an allocate statement.

gcc/fortran/ChangeLog:

* dump-parse-tree.cc (show_omp_node): Handle EXEC_OMP_ALLOCATE.
(show_code_node): Likewise.
* gfortran.h (enum gfc_statement): Add ST_OMP_ALLOCATE.
(OMP_LIST_ALLOCATOR): New enum value.
(enum gfc_exec_op): Add EXEC_OMP_ALLOCATE.
* match.h (gfc_match_omp_allocate): New function.
* openmp.cc (enum omp_mask1): Add OMP_CLAUSE_ALLOCATOR.
(OMP_ALLOCATE_CLAUSES): New define.
(gfc_match_omp_allocate): New function.
(resolve_omp_clauses): Add ALLOCATOR in clause_names.
(omp_code_to_statement): Handle EXEC_OMP_ALLOCATE.
(EMPTY_VAR_LIST): New define.
(check_allocate_directive_restrictions): New function.
(gfc_resolve_omp_allocate): Likewise.
(gfc_resolve_omp_directive): Handle EXEC_OMP_ALLOCATE.
* parse.cc (decode_omp_directive): Handle ST_OMP_ALLOCATE.
(next_statement): Likewise.
(gfc_ascii_statement): Likewise.
* resolve.cc (gfc_resolve_code): Handle EXEC_OMP_ALLOCATE.
* st.cc (gfc_free_statement): Likewise.
* trans.cc (trans_code): Likewise
---
 gcc/fortran/dump-parse-tree.cc|   3 +
 gcc/fortran/gfortran.h|   4 +-
 gcc/fortran/match.h   |   1 +
 gcc/fortran/openmp.cc | 199 +-
 gcc/fortran/parse.cc  |  10 +-
 gcc/fortran/resolve.cc|   1 +
 gcc/fortran/st.cc |   1 +
 gcc/fortran/trans.cc  |   1 +
 gcc/testsuite/gfortran.dg/gomp/allocate-4.f90 | 112 ++
 gcc/testsuite/gfortran.dg/gomp/allocate-5.f90 |  73 +++
 10 files changed, 400 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/allocate-4.f90
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/allocate-5.f90

diff --git a/gcc/fortran/dump-parse-tree.cc b/gcc/fortran/dump-parse-tree.cc
index 5352008a63d..e0c6c0d9d96 100644
--- a/gcc/fortran/dump-parse-tree.cc
+++ b/gcc/fortran/dump-parse-tree.cc
@@ -2003,6 +2003,7 @@ show_omp_node (int level, gfc_code *c)
 case EXEC_OACC_CACHE: name = "CACHE"; is_oacc = true; break;
 case EXEC_OACC_ENTER_DATA: name = "ENTER DATA"; is_oacc = true; break;
 case EXEC_OACC_EXIT_DATA: name = "EXIT DATA"; is_oacc = true; break;
+case EXEC_OMP_ALLOCATE: name = "ALLOCATE"; break;
 case EXEC_OMP_ATOMIC: name = "ATOMIC"; break;
 case EXEC_OMP_BARRIER: name = "BARRIER"; break;
 case EXEC_OMP_CANCEL: name = "CANCEL"; break;
@@ -2204,6 +2205,7 @@ show_omp_node (int level, gfc_code *c)
   || c->op == EXEC_OMP_TARGET_UPDATE || c->op == EXEC_OMP_TARGET_ENTER_DATA
   || c->op == EXEC_OMP_TARGET_EXIT_DATA || c->op == EXEC_OMP_SCAN
   || c->op == EXEC_OMP_DEPOBJ || c->op == EXEC_OMP_ERROR
+  || c->op == EXEC_OMP_ALLOCATE
   || (c->op == EXEC_OMP_ORDERED && c->block == NULL))
 return;
   if (c->op == EXEC_OMP_SECTIONS || c->op == EXEC_OMP_PARALLEL_SECTIONS)
@@ -3329,6 +3331,7 @@ show_code_node (int level, gfc_code *c)
 case EXEC_OACC_CACHE:
 case EXEC_OACC_ENTER_DATA:
 case EXEC_OACC_EXIT_DATA:
+case EXEC_OMP_ALLOCATE:
 case EXEC_OMP_ATOMIC:
 case EXEC_OMP_CANCEL:
 case EXEC_OMP_CANCELLATION_POINT:
diff --git a/gcc/fortran/gfortran.h b/gcc/fortran/gfortran.h
index 696aadd7db6..755469185a6 100644
--- a/gcc/fortran/gfortran.h
+++ b/gcc/fortran/gfortran.h
@@ -259,7 +259,7 @@ enum gfc_statement
   ST_OACC_CACHE, ST_OACC_KERNELS_LOOP, ST_OACC_END_KERNELS_LOOP,
   ST_OACC_SERIAL_LOOP, ST_OACC_END_SERIAL_LOOP, ST_OACC_SERIAL,
   ST_OACC_END_SERIAL, ST_OACC_ENTER_DATA, ST_OACC_EXIT_DATA, ST_OACC_ROUTINE,
-  ST_OACC_ATOMIC, ST_OACC_END_ATOMIC,
+  ST_OACC_ATOMIC, ST_OACC_END_ATOMIC, ST_OMP_ALLOCATE,
   ST_OMP_ATOMIC, ST_OMP_BARRIER, ST_OMP_CRITICAL, ST_OMP_END_ATOMIC,
   ST_OMP_END_CRITICAL, ST_OMP_END_DO, ST_OMP_END_MASTER, ST_OMP_END_ORDERED,
   ST_OMP_END_PARALLEL, ST_OMP_END_PARALLEL_DO, ST_OMP_END_PARALLEL_SECTIONS,
@@ -1398,6 +1398,7 @@ enum
   OMP_LIST_USE_DEVICE_ADDR,
   OMP_LIST_NONTEMPORAL,
   OMP_LIST_ALLOCATE,
+  OMP_LIST_ALLOCATOR,
   OMP_LIST_HAS_DEVICE_ADDR,
   OMP_LIST_ENTER,
   OMP_LIST_NUM /* Must be the last.  */
@@ -2908,6 +2909,7 @@ enum gfc_exec_op
   EXEC_OACC_DATA, EXEC_OACC_HOST_DATA, EXEC_OACC_LOOP, EXEC_OACC_UPDATE,
   EXEC_OACC_WAIT, EXEC_OACC_CACHE, EXEC_OACC_ENTER_DATA, EXEC_OACC_EXIT_DATA,
   EXEC_OACC_ATOMIC, EXEC_OACC_DECLARE,
+  EXEC_OMP_ALLOCATE,
   EXEC_OMP_CRITICAL, EXEC_OMP_DO, EXEC_OMP_FLUSH, EXEC_OMP_MASTER,
   EXEC_OMP_ORDERED, EXEC_OMP_PARALLEL, EXEC_OMP_PARALLEL_DO,
   EXEC_OMP_PARALLEL_SECTIONS, EXEC_OMP_PARALLEL_WORKSHARE,
diff --git a/gcc/fortran/match.h b/gcc/fortran/match.h
index 495c93e0b5c..fe43d4b3fd3 100644
--- a/gcc/fortran/match.h
+++ b/gcc/fortran/match.h
@@ -149,6 +149,7 @@ match gfc_match_oacc_routine (

[PATCH 17/17] amdgcn: libgomp plugin USM implementation


Implement the Unified Shared Memory API calls in the GCN plugin.

The allocate and free are pretty straight-forward because all "target" memory
allocations are compatible with USM, on the right hardware.  However, there's
no known way to check what memory region was used, after the fact, so we use a
splay tree to record allocations so we can answer "is_usm_ptr" later.

libgomp/ChangeLog:

* plugin/plugin-gcn.c (GOMP_OFFLOAD_get_num_devices): Allow
GOMP_REQUIRES_UNIFIED_ADDRESS and GOMP_REQUIRES_UNIFIED_SHARED_MEMORY.
(struct usm_splay_tree_key_s): New.
(usm_splay_compare): New.
(splay_tree_prefix): New.
(GOMP_OFFLOAD_usm_alloc): New.
(GOMP_OFFLOAD_usm_free): New.
(GOMP_OFFLOAD_is_usm_ptr): New.
(GOMP_OFFLOAD_supported_features): Move into the OpenMP API fold.
Add GOMP_REQUIRES_UNIFIED_ADDRESS and
GOMP_REQUIRES_UNIFIED_SHARED_MEMORY.
(gomp_fatal): New.
(splay_tree_c): New.
* testsuite/lib/libgomp.exp (check_effective_target_omp_usm): New.
* testsuite/libgomp.c++/usm-1.C: Use dg-require-effective-target.
* testsuite/libgomp.c-c++-common/requires-1.c: Likewise.
* testsuite/libgomp.c/usm-1.c: Likewise.
* testsuite/libgomp.c/usm-2.c: Likewise.
* testsuite/libgomp.c/usm-3.c: Likewise.
* testsuite/libgomp.c/usm-4.c: Likewise.
* testsuite/libgomp.c/usm-5.c: Likewise.
* testsuite/libgomp.c/usm-6.c: Likewise.
---
 libgomp/plugin/plugin-gcn.c   | 104 +-
 libgomp/testsuite/lib/libgomp.exp |  22 
 libgomp/testsuite/libgomp.c++/usm-1.C |   2 +-
 .../libgomp.c-c++-common/requires-1.c |   1 +
 libgomp/testsuite/libgomp.c/usm-1.c   |   1 +
 libgomp/testsuite/libgomp.c/usm-2.c   |   1 +
 libgomp/testsuite/libgomp.c/usm-3.c   |   1 +
 libgomp/testsuite/libgomp.c/usm-4.c   |   1 +
 libgomp/testsuite/libgomp.c/usm-5.c   |   2 +-
 libgomp/testsuite/libgomp.c/usm-6.c   |   2 +-
 10 files changed, 133 insertions(+), 4 deletions(-)

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index ea327bf2ca0..6a9ff5cd93e 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -3226,7 +3226,11 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
   if (!init_hsa_context ())
 return 0;
   /* Return -1 if no omp_requires_mask cannot be fulfilled but
- devices were present.  */
+ devices were present.
+ Note: not all devices support USM, but the compiler refuses to create
+ binaries for those that don't anyway.  */
+  omp_requires_mask &= ~(GOMP_REQUIRES_UNIFIED_ADDRESS
+			 | GOMP_REQUIRES_UNIFIED_SHARED_MEMORY);
   if (hsa_context.agent_count > 0 && omp_requires_mask != 0)
 return -1;
   return hsa_context.agent_count;
@@ -3810,6 +3814,89 @@ GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void *tgt_vars,
 		   GOMP_PLUGIN_target_task_completion, async_data);
 }
 
+/* Use a splay tree to track USM allocations.  */
+
+typedef struct usm_splay_tree_node_s *usm_splay_tree_node;
+typedef struct usm_splay_tree_s *usm_splay_tree;
+typedef struct usm_splay_tree_key_s *usm_splay_tree_key;
+
+struct usm_splay_tree_key_s {
+  void *addr;
+  size_t size;
+};
+
+static inline int
+usm_splay_compare (usm_splay_tree_key x, usm_splay_tree_key y)
+{
+  if ((x->addr <= y->addr && x->addr + x->size > y->addr)
+  || (y->addr <= x->addr && y->addr + y->size > x->addr))
+return 0;
+
+  return (x->addr > y->addr ? 1 : -1);
+}
+
+#define splay_tree_prefix usm
+#include "../splay-tree.h"
+
+static struct usm_splay_tree_s usm_map = { NULL };
+
+/* Allocate memory suitable for Unified Shared Memory.
+
+   In fact, AMD memory need only be "coarse grained", which target
+   allocations already are.  We do need to track allocations so that
+   GOMP_OFFLOAD_is_usm_ptr can look them up.  */
+
+void *
+GOMP_OFFLOAD_usm_alloc (int device, size_t size)
+{
+  void *ptr = GOMP_OFFLOAD_alloc (device, size);
+
+  usm_splay_tree_node node = malloc (sizeof (struct usm_splay_tree_node_s));
+  node->key.addr = ptr;
+  node->key.size = size;
+  node->left = NULL;
+  node->right = NULL;
+  usm_splay_tree_insert (&usm_map, node);
+
+  return ptr;
+}
+
+/* Free memory allocated via GOMP_OFFLOAD_usm_alloc.  */
+
+bool
+GOMP_OFFLOAD_usm_free (int device, void *ptr)
+{
+  struct usm_splay_tree_key_s key = { ptr, 1 };
+  usm_splay_tree_key node = usm_splay_tree_lookup (&usm_map, &key);
+  if (node)
+{
+  usm_splay_tree_remove (&usm_map, &key);
+  free (node);
+}
+
+  return GOMP_OFFLOAD_free (device, ptr);
+}
+
+/* True if the memory was allocated via GOMP_OFFLOAD_usm_alloc.  */
+
+bool
+GOMP_OFFLOAD_is_usm_ptr (void *ptr)
+{
+  struct usm_splay_tree_key_s key = { ptr, 1 };
+  return usm_splay_tree_lookup (&usm_map, &key);
+}
+
+/* Indicate which GOMP_REQUIRES_* features are supported.  */
+
+bool
+GO

[PATCH 15/17] amdgcn: Support XNACK mode


The XNACK feature allows memory load instructions to restart safely following
a page-miss interrupt.  This is useful for shared-memory devices, like APUs,
and to implement OpenMP Unified Shared Memory.

To support the feature we must be able to set the appropriate meta-data and
set the load instructions to early-clobber.  When the port supports scheduling
of s_waitcnt instructions there will be further requirements.

gcc/ChangeLog:

* config/gcn/gcn-hsa.h (XNACKOPT): New macro.
(ASM_SPEC): Use XNACKOPT.
* config/gcn/gcn-opts.h (enum sram_ecc_type): Rename to ...
(enum hsaco_attr_type): ... this, and generalize the names.
(TARGET_XNACK): New macro.
* config/gcn/gcn-valu.md (gather_insn_1offset):
Add xnack compatible alternatives.
(gather_insn_2offsets): Likewise.
* config/gcn/gcn.c (gcn_option_override): Permit -mxnack for devices
other than Fiji.
(gcn_expand_epilogue): Remove early-clobber problems.
(output_file_start): Emit xnack attributes.
(gcn_hsa_declare_function_name): Obey -mxnack setting.
* config/gcn/gcn.md (xnack): New attribute.
(enabled): Rework to include "xnack" attribute.
(*movbi): Add xnack compatible alternatives.
(*mov_insn): Likewise.
(*mov_insn): Likewise.
(*mov_insn): Likewise.
(*movti_insn): Likewise.
* config/gcn/gcn.opt (-mxnack): Add the "on/off/any" syntax.
(sram_ecc_type): Rename to ...
(hsaco_attr_type: ... this.)
* config/gcn/mkoffload.c (SET_XNACK_ANY): New macro.
(TEST_XNACK): Delete.
(TEST_XNACK_ANY): New macro.
(TEST_XNACK_ON): New macro.
(main): Support the new -mxnack=on/off/any syntax.
---
 gcc/config/gcn/gcn-hsa.h|   3 +-
 gcc/config/gcn/gcn-opts.h   |  10 ++--
 gcc/config/gcn/gcn-valu.md  |  29 -
 gcc/config/gcn/gcn.cc   |  34 ++-
 gcc/config/gcn/gcn.md   | 113 +++-
 gcc/config/gcn/gcn.opt  |  18 +++---
 gcc/config/gcn/mkoffload.cc |  19 --
 7 files changed, 140 insertions(+), 86 deletions(-)

diff --git a/gcc/config/gcn/gcn-hsa.h b/gcc/config/gcn/gcn-hsa.h
index b3079cebb43..fd08947574f 100644
--- a/gcc/config/gcn/gcn-hsa.h
+++ b/gcc/config/gcn/gcn-hsa.h
@@ -81,12 +81,13 @@ extern unsigned int gcn_local_sym_hash (const char *name);
 /* In HSACOv4 no attribute setting means the binary supports "any" hardware
configuration.  The name of the attribute also changed.  */
 #define SRAMOPT "msram-ecc=on:-mattr=+sramecc;msram-ecc=off:-mattr=-sramecc"
+#define XNACKOPT "mxnack=on:-mattr=+xnack;mxnack=off:-mattr=-xnack"
 
 /* Use LLVM assembler and linker options.  */
 #define ASM_SPEC  "-triple=amdgcn--amdhsa "  \
 		  "%:last_arg(%{march=*:-mcpu=%*}) " \
 		  "%{!march=*|march=fiji:--amdhsa-code-object-version=3} " \
-		  "%{" NO_XNACK "mxnack:-mattr=+xnack;:-mattr=-xnack} " \
+		  "%{" NO_XNACK XNACKOPT "}" \
 		  "%{" NO_SRAM_ECC SRAMOPT "} " \
 		  "-filetype=obj"
 #define LINK_SPEC "--pie --export-dynamic"
diff --git a/gcc/config/gcn/gcn-opts.h b/gcc/config/gcn/gcn-opts.h
index b62dfb45f59..07ddc79cda3 100644
--- a/gcc/config/gcn/gcn-opts.h
+++ b/gcc/config/gcn/gcn-opts.h
@@ -48,11 +48,13 @@ extern enum gcn_isa {
 #define TARGET_M0_LDS_LIMIT (TARGET_GCN3)
 #define TARGET_PACKED_WORK_ITEMS (TARGET_CDNA2_PLUS)
 
-enum sram_ecc_type
+#define TARGET_XNACK (flag_xnack != HSACO_ATTR_OFF)
+
+enum hsaco_attr_type
 {
-  SRAM_ECC_OFF,
-  SRAM_ECC_ON,
-  SRAM_ECC_ANY
+  HSACO_ATTR_OFF,
+  HSACO_ATTR_ON,
+  HSACO_ATTR_ANY
 };
 
 #endif
diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index abe46201344..ec114db9dd1 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -741,13 +741,13 @@ (define_expand "gather_expr"
 {})
 
 (define_insn "gather_insn_1offset"
-  [(set (match_operand:V_ALL 0 "register_operand"		   "=v")
+  [(set (match_operand:V_ALL 0 "register_operand"		   "=v,&v")
 	(unspec:V_ALL
-	  [(plus: (match_operand: 1 "register_operand" " v")
+	  [(plus: (match_operand: 1 "register_operand" " v, v")
 			(vec_duplicate:
-			  (match_operand 2 "immediate_operand"	   " n")))
-	   (match_operand 3 "immediate_operand"			   " n")
-	   (match_operand 4 "immediate_operand"			   " n")
+			  (match_operand 2 "immediate_operand"	   " n, n")))
+	   (match_operand 3 "immediate_operand"			   " n, n")
+	   (match_operand 4 "immediate_operand"			   " n, n")
 	   (mem:BLK (scratch))]
 	  UNSPEC_GATHER))]
   "(AS_FLAT_P (INTVAL (operands[3]))
@@ -777,7 +777,8 @@ (define_insn "gather_insn_1offset"
 return buf;
   }
   [(set_attr "type" "flat")
-   (set_attr "length" "12")])
+   (set_attr "length" "12")
+   (set_attr "xnack" "off,on")])
 
 (define_insn "gather_insn_1offset_ds"
   [(set (match_operand:V_ALL 0 "register_operand"		   "=v")
@@ -802,17 +803,18 @@ (define_insn "gather_insn_1offset_ds"
(set_attr "length" "12")])
 
 (define_insn "gather_insn_2o

[PATCH 16/17] amdgcn, openmp: Auto-detect USM mode and set HSA_XNACK


The AMD GCN runtime must be set to the correct mode for Unified Shared Memory
to work, but this is not always clear at compile and link time due to the split
nature of the offload compilation pipeline.

This patch sets a new attribute on OpenMP offload functions to ensure that the
information is passed all the way to the backend.  The backend then places a
marker in the assembler code for mkoffload to find. Finally mkoffload places a
constructor function into the final program to ensure that the HSA_XNACK
environment variable passes the correct mode to the GPU.

The HSA_XNACK variable must be set before the HSA runtime is even loaded, so
it makes more sense to have this set within the constructor than at some point
later within libgomp or the GCN plugin.

gcc/ChangeLog:

* config/gcn/gcn.c (unified_shared_memory_enabled): New variable.
(gcn_init_cumulative_args): Handle attribute "omp unified memory".
(gcn_hsa_declare_function_name): Emit "MKOFFLOAD OPTIONS: USM+".
* config/gcn/mkoffload.c (TEST_XNACK_OFF): New macro.
(process_asm): Detect "MKOFFLOAD OPTIONS: USM+".
Emit configure_xnack constructor, as required.
* omp-low.c (create_omp_child_function): Add attribute "omp unified
memory".
---
 gcc/config/gcn/gcn.cc   | 28 +++-
 gcc/config/gcn/mkoffload.cc | 37 -
 gcc/omp-low.cc  |  4 
 3 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 4df05453604..88cc505597e 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -68,6 +68,11 @@ static bool ext_gcn_constants_init = 0;
 
 enum gcn_isa gcn_isa = ISA_GCN3;	/* Default to GCN3.  */
 
+/* Record whether the host compiler added "omp unifed memory" attributes to
+   any functions.  We can then pass this on to mkoffload to ensure xnack is
+   compatible there too.  */
+static bool unified_shared_memory_enabled = false;
+
 /* Reserve this much space for LDS (for propagating variables from
worker-single mode to worker-partitioned mode), per workgroup.  Global
analysis could calculate an exact bound, but we don't do that yet.
@@ -2542,6 +2547,25 @@ gcn_init_cumulative_args (CUMULATIVE_ARGS *cum /* Argument info to init */ ,
   if (!caller && cfun->machine->normal_function)
 gcn_detect_incoming_pointer_arg (fndecl);
 
+  if (fndecl && lookup_attribute ("omp unified memory",
+  DECL_ATTRIBUTES (fndecl)))
+{
+  unified_shared_memory_enabled = true;
+
+  switch (gcn_arch)
+	{
+	case PROCESSOR_FIJI:
+	case PROCESSOR_VEGA10:
+	case PROCESSOR_VEGA20:
+	  error ("GPU architecture does not support Unified Shared Memory");
+	default:
+	  ;
+	}
+
+  if (flag_xnack == HSACO_ATTR_OFF)
+	error ("Unified Shared Memory is enabled, but XNACK is disabled");
+}
+
   reinit_regs ();
 }
 
@@ -5458,12 +5482,14 @@ gcn_hsa_declare_function_name (FILE *file, const char *name, tree)
   assemble_name (file, name);
   fputs (":\n", file);
 
-  /* This comment is read by mkoffload.  */
+  /* These comments are read by mkoffload.  */
   if (flag_openacc)
 fprintf (file, "\t;; OPENACC-DIMS: %d, %d, %d : %s\n",
 	 oacc_get_fn_dim_size (cfun->decl, GOMP_DIM_GANG),
 	 oacc_get_fn_dim_size (cfun->decl, GOMP_DIM_WORKER),
 	 oacc_get_fn_dim_size (cfun->decl, GOMP_DIM_VECTOR), name);
+  if (unified_shared_memory_enabled)
+fprintf (asm_out_file, "\t;; MKOFFLOAD OPTIONS: USM+\n");
 }
 
 /* Implement TARGET_ASM_SELECT_SECTION.
diff --git a/gcc/config/gcn/mkoffload.cc b/gcc/config/gcn/mkoffload.cc
index cb8903c27cb..5741d0a917b 100644
--- a/gcc/config/gcn/mkoffload.cc
+++ b/gcc/config/gcn/mkoffload.cc
@@ -80,6 +80,8 @@
 			 == EF_AMDGPU_FEATURE_XNACK_ANY_V4)
 #define TEST_XNACK_ON(VAR) ((VAR & EF_AMDGPU_FEATURE_XNACK_V4) \
 			== EF_AMDGPU_FEATURE_XNACK_ON_V4)
+#define TEST_XNACK_OFF(VAR) ((VAR & EF_AMDGPU_FEATURE_XNACK_V4) \
+			 == EF_AMDGPU_FEATURE_XNACK_OFF_V4)
 
 #define SET_SRAM_ECC_ON(VAR) VAR = ((VAR & ~EF_AMDGPU_FEATURE_SRAMECC_V4) \
 | EF_AMDGPU_FEATURE_SRAMECC_ON_V4)
@@ -474,6 +476,7 @@ static void
 process_asm (FILE *in, FILE *out, FILE *cfile)
 {
   int fn_count = 0, var_count = 0, dims_count = 0, regcount_count = 0;
+  bool unified_shared_memory_enabled = false;
   struct obstack fns_os, dims_os, regcounts_os;
   obstack_init (&fns_os);
   obstack_init (&dims_os);
@@ -498,6 +501,7 @@ process_asm (FILE *in, FILE *out, FILE *cfile)
   fn_count += 2;
 
   char buf[1000];
+  char dummy;
   enum
 { IN_CODE,
   IN_METADATA,
@@ -517,6 +521,9 @@ process_asm (FILE *in, FILE *out, FILE *cfile)
 		dims_count++;
 	  }
 
+	if (sscanf (buf, " ;; MKOFFLOAD OPTIONS: USM+%c", &dummy) > 0)
+	  unified_shared_memory_enabled = true;
+
 	break;
 	  }
 	case IN_METADATA:
@@ -565,7 +572,6 @@ process_asm (FILE *in, FILE *out, FILE *cfile)
 	  }
 	}
 
-  char dummy;
   if (sscanf (buf, " .section

Re: [PATCH 08/17] openmp: -foffload-memory=pinned


On 07/07/2022 12:54, Tobias Burnus wrote:

Hi Andrew,

On 07.07.22 12:34, Andrew Stubbs wrote:

Implement the -foffload-memory=pinned option such that libgomp is
instructed to enable fully-pinned memory at start-up.  The option is
intended to provide a performance boost to certain offload programs 
without

modifying the code.

...

gcc/ChangeLog:

* omp-builtins.def (BUILT_IN_GOMP_ENABLE_PINNED_MODE): New.
* omp-low.cc (omp_enable_pinned_mode): New function.
(execute_lower_omp): Call omp_enable_pinned_mode.

libgomp/ChangeLog:

* config/linux/allocator.c (always_pinned_mode): New variable.
(GOMP_enable_pinned_mode): New function.
(linux_memspace_alloc): Disable pinning when always_pinned_mode set.
(linux_memspace_calloc): Likewise.
(linux_memspace_free): Likewise.
(linux_memspace_realloc): Likewise.
* libgomp.map: Add GOMP_enable_pinned_mode.
* testsuite/libgomp.c/alloc-pinned-7.c: New test.
...

...

--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -14620,6 +14620,68 @@ lower_omp (gimple_seq *body, omp_context *ctx)
    input_location = saved_location;
  }
+/* Emit a constructor function to enable -foffload-memory=pinned
+   at runtime.  Libgomp handles the OS mode setting, but we need to 
trigger
+   it by calling GOMP_enable_pinned mode before the program proper 
runs.  */

+
+static void
+omp_enable_pinned_mode ()


Is there a reason not to use the mechanism of OpenMP's 'requires' 
directive for this?


(Okay, I have to admit that the final patch was only committed on 
Monday. But still ...)


Possibly, I had most of this done before then. I'll have a look next 
time I visit this patch.


The Cuda-specific solution can't work this way anyway, because there's 
no mlockall equivalent, so I will make conditional adjustments anyway.


Likewise, the 'requires' mechanism could then also be used in '[PATCH 
16/17] amdgcn, openmp: Auto-detect USM mode and set HSA_XNACK'.


No, I don't think so; that environment variable needs to be set before 
the libraries are loaded or it's too late.  There are other ways to 
achieve the same thing, by leaving messages for the libgomp plugin to 
pick up, perhaps, but it's all extra complexity for no real gain.


Andrew

Re: [PATCH 08/17] openmp: -foffload-memory=pinned

2022-07-08 Thread Andrew Stubbs


On 08/07/2022 10:00, Tobias Burnus wrote:

On 08.07.22 00:18, Andrew Stubbs wrote:
Likewise, the 'requires' mechanism could then also be used in '[PATCH 
16/17] amdgcn, openmp: Auto-detect USM mode and set HSA_XNACK'.


No, I don't think so; that environment variable needs to be set before 
the libraries are loaded or it's too late.  There are other ways to 
achieve the same thing, by leaving messages for the libgomp plugin to 
pick up, perhaps, but it's all extra complexity for no real gain. 


I think we talk about two different things:


(a) where (and when) to check/set the environment variable. I think this 
part is fine. You could consider moving the generated code for 
'configure_xnack' code into the existing 'init' constructor function, 
but it does not really matter. (Nor does the order in which the 
constructor function runs.)


(I also do not see any benefit of moving it to libgomp. The message 
could then be suppressed if no device available or similar tricky, but I 
do not see any real advantage of moving it.)


Longer side note: I think the message "error: HSA_XNACK=%%s is 
incompatible; please unset" could be clearer. Both in terms who issues 
it and that it talks about an environment variable. Maybe:


"|libgomp: fatal error: Environment variable HSA_XNACK=%s is 
incompatible with GCN offloading; please unset"|


|or something like that. (I did misuse 'libgomp:' for this; I am not 
sure that makes sense or is even more misleading.) – I am also not sure 
GCN fits that well, given that CDNA is not GCN. But that is a general
problem. But in any case, adding "fatal", "environment variable" and ... 
offloading makes surely sense, IMHO.


It's not incompatible with GCN offloading, only with the XNACK mode in 
which the binary was compiled (i.e. USM on or off).


The message could be less terse, indeed. I went through a variety of 
messages for this and couldn't find one that I liked. How about...


  fatal error: HSA_XNACK=%s is set but this program was compiled for 
HSA_XNACK=%s; please unset your environment variable.


(b) How the value is made available inside both gcc/config/gcn/gcn.cc 
and in mkoffload.cc.


I was talking about (b). Namely:

omp_requires_mask is already available in gcc/config/gcn/gcn.cc and 
mkoffload.cc. Thus, there is no reason to reinvent the wheel and coming 
up with another means to pass the same kind of data to the very same files.


(You still might want to add another flag to it (assuming 'omp requires 
unified_shared_memory' alias OMP_REQUIRES_UNIFIED_SHARED_MEMORY is 
insufficient.)


OK, this is a new feature that I probably should investigate.

Thanks

Andrew

[PATCH] openmp: fix max_vf setting for amdgcn offloading

2022-07-12 Thread Andrew Stubbs

This patch ensures that the maximum vectorization factor used to set the 
"safelen" attribute on "omp simd" constructs is suitable for all the 
configured offload devices.


Right now it makes the proper adjustment for NVPTX, but otherwise just 
uses a value suitable for the host system (always x86_64 in the case of 
amdgcn).  This typically ends up being 16 where 64 is the minimum for 
vectorization to work properly on GCN.


There is a potential problem that one "safelen" must be set for *all* 
offload devices, which means it can't be perfect for all devices. 
However I believe that too big is always OK (at least for powers of 
two?) whereas too small is not OK, so this code always selects the 
largest value of max_vf, regardless of where it comes from.


The existing target VF function, omp_max_simt_vf, is tangled up with the 
notion of whether SIMT is available or not, so I couldn't add amdgcn in 
there. It's tempting to have omp_max_vf do some kind of autodetect what 
VF to choose, but the current implementation in omp-general.cc doesn't 
have access to the context in a convenient way, and nor do all the 
callers, so I couldn't easily do that. Instead, I have opted to add a 
new function, omp_max_simd_vf, which can check for the presence of amdgcn.


While reviewing the callers of omp_max_vf I found one other case that 
looks like it ought to be tuned for the device, not just the host. In 
that case it's not clear how to achieve that and in fact, at least on 
x86_64, the way it is coded the actual value from omp_max_vf is always 
ignored in favour of a much larger "minimum", so I have added a comment 
for the next person to touch that spot and left it alone.


This change gives a 10x performance improvement on the BabelStream "dot" 
benchmark on amdgcn and is not harmful on nvptx.


OK for mainline?

I will commit a backport to OG12 shortly.

Andrewopenmp: fix max_vf setting for amdgcn offloading

Ensure that the "max_vf" figure used for the "safelen" attribute is large
enough for the largest configured offload device.

This change gives ~10x speed improvement on the Bablestream "dot" benchmark for
AMD GCN.

gcc/ChangeLog:

* gimple-loop-versioning.cc (loop_versioning::loop_versioning): Add
comment.
* omp-general.cc (omp_max_simd_vf): New function.
* omp-general.h (omp_max_simd_vf): New prototype.
* omp-low.cc (lower_rec_simd_input_clauses): Select largest from
  omp_max_vf, omp_max_simt_vf, and omp_max_simd_vf.

gcc/testsuite/ChangeLog:

* lib/target-supports.exp
(check_effective_target_amdgcn_offloading_enabled): New.
(check_effective_target_nvptx_offloading_enabled): New.
* gcc.dg/gomp/target-vf.c: New test.

diff --git a/gcc/gimple-loop-versioning.cc b/gcc/gimple-loop-versioning.cc
index 6bcf6eba691..e908c27fc44 100644
--- a/gcc/gimple-loop-versioning.cc
+++ b/gcc/gimple-loop-versioning.cc
@@ -555,7 +555,10 @@ loop_versioning::loop_versioning (function *fn)
  unvectorizable code, since it is the largest size that can be
  handled efficiently by scalar code.  omp_max_vf calculates the
  maximum number of bytes in a vector, when such a value is relevant
- to loop optimization.  */
+ to loop optimization.
+ FIXME: this probably needs to use omp_max_simd_vf when in a target
+ region, but how to tell? (And MAX_FIXED_MODE_SIZE is large enough that
+ it doesn't actually matter.)  */
   m_maximum_scale = estimated_poly_value (omp_max_vf ());
   m_maximum_scale = MAX (m_maximum_scale, MAX_FIXED_MODE_SIZE);
 }
diff --git a/gcc/omp-general.cc b/gcc/omp-general.cc
index a406c578f33..8c6fcebc4b3 100644
--- a/gcc/omp-general.cc
+++ b/gcc/omp-general.cc
@@ -994,6 +994,24 @@ omp_max_simt_vf (void)
   return 0;
 }
 
+/* Return maximum SIMD width if offloading may target SIMD hardware.  */
+
+int
+omp_max_simd_vf (void)
+{
+  if (!optimize)
+return 0;
+  if (ENABLE_OFFLOADING)
+for (const char *c = getenv ("OFFLOAD_TARGET_NAMES"); c;)
+  {
+   if (startswith (c, "amdgcn"))
+ return 64;
+   else if ((c = strchr (c, ':')))
+ c++;
+  }
+  return 0;
+}
+
 /* Store the construct selectors as tree codes from last to first,
return their number.  */
 
diff --git a/gcc/omp-general.h b/gcc/omp-general.h
index 74e90e1a71a..410343e45fa 100644
--- a/gcc/omp-general.h
+++ b/gcc/omp-general.h
@@ -104,6 +104,7 @@ extern gimple *omp_build_barrier (tree lhs);
 extern tree find_combined_omp_for (tree *, int *, void *);
 extern poly_uint64 omp_max_vf (void);
 extern int omp_max_simt_vf (void);
+extern int omp_max_simd_vf (void);
 extern int omp_constructor_traits_to_codes (tree, enum tree_code *);
 extern tree omp_check_context_selector (location_t loc, tree ctx);
 extern void omp_mark_declare_variant (location_t loc, tree variant,
diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index d73c165f029..1a9a509adb9 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -4646,7 +4646,14 @@ lowe

[committed] amdgcn: 64-bit not


I've committed this patch to enable DImode one's-complement on amdgcn.

The hardware doesn't have 64-bit not, and this isn't needed by expand 
which is happy to use two SImode operations, but the vectorizer isn't so 
clever. Vector condition masks are DImode on amdgcn, so this has been 
causing lots of conditional code to fail to vectorize.


Andrewamdgcn: 64-bit not

This makes the auto-vectorizer happier when handling masks.

gcc/ChangeLog:

* config/gcn/gcn.md (one_cmpldi2): New.

diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md
index 033c1708e88..70a769babc4 100644
--- a/gcc/config/gcn/gcn.md
+++ b/gcc/config/gcn/gcn.md
@@ -1676,6 +1676,26 @@ (define_expand "si3_scc"
 ;; }}}
 ;; {{{ ALU: generic 64-bit
 
+(define_insn_and_split "one_cmpldi2"
+  [(set (match_operand:DI 0 "register_operand""=Sg,v")
+   (not:DI (match_operand:DI 1 "gcn_alu_operand" "SgA,vSvDB")))
+   (clobber (match_scratch:BI 2  "=cs,X"))]
+  ""
+  "#"
+  "reload_completed"
+  [(parallel [(set (match_dup 3) (not:SI (match_dup 4)))
+ (clobber (match_dup 2))])
+   (parallel [(set (match_dup 5) (not:SI (match_dup 6)))
+ (clobber (match_dup 2))])]
+  {
+operands[3] = gcn_operand_part (DImode, operands[0], 0);
+operands[4] = gcn_operand_part (DImode, operands[1], 0);
+operands[5] = gcn_operand_part (DImode, operands[0], 1);
+operands[6] = gcn_operand_part (DImode, operands[1], 1);
+  }
+  [(set_attr "type" "mult")]
+)
+
 (define_code_iterator vec_and_scalar64_com [and ior xor])
 
 (define_insn_and_split "di3"

[committed] amdgcn: 64-bit vector shifts

I've committed this patch to implement V64DImode vector-vector and 
vector-scalar shifts.


In particular, these are used by the SIMD "inbranch" clones that I'm 
working on right now, but it's an omission that ought to have been fixed 
anyway.


Andrewamdgcn: 64-bit vector shifts

Enable 64-bit vector-vector and vector-scalar shifts.

gcc/ChangeLog:

* config/gcn/gcn-valu.md (V_INT_noHI): New iterator.
(3): Use V_INT_noHI.
(v3): Likewise.

diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index abe46201344..8c33ae0c717 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -60,6 +60,8 @@ (define_mode_iterator V_noHI
 
 (define_mode_iterator V_INT_noQI
  [V64HI V64SI V64DI])
+(define_mode_iterator V_INT_noHI
+ [V64SI V64DI])
 
 ; All of above
 (define_mode_iterator V_ALL
@@ -2086,10 +2088,10 @@ (define_expand "3"
   })
 
 (define_insn "3"
-  [(set (match_operand:V_SI 0 "register_operand"  "= v")
-   (shiftop:V_SI
- (match_operand:V_SI 1 "gcn_alu_operand" "  v")
- (vec_duplicate:V_SI
+  [(set (match_operand:V_INT_noHI 0 "register_operand"  "= v")
+   (shiftop:V_INT_noHI
+ (match_operand:V_INT_noHI 1 "gcn_alu_operand" "  v")
+ (vec_duplicate:
(match_operand:SI 2 "gcn_alu_operand"  "SvB"]
   ""
   "v_0\t%0, %2, %1"
@@ -2117,10 +2119,10 @@ (define_expand "v3"
   })
 
 (define_insn "v3"
-  [(set (match_operand:V_SI 0 "register_operand"  "=v")
-   (shiftop:V_SI
- (match_operand:V_SI 1 "gcn_alu_operand" " v")
- (match_operand:V_SI 2 "gcn_alu_operand" "vB")))]
+  [(set (match_operand:V_INT_noHI 0 "register_operand"  "=v")
+   (shiftop:V_INT_noHI
+ (match_operand:V_INT_noHI 1 "gcn_alu_operand" " v")
+ (match_operand: 2 "gcn_alu_operand" "vB")))]
   ""
   "v_0\t%0, %2, %1"
   [(set_attr "type" "vop2")

[PATCH] openmp-simd-clone: Match shift type

This patch adjusts the generation of SIMD "inbranch" clones that use 
integer masks to ensure that it vectorizes on amdgcn.


The problem was only that an amdgcn mask is DImode and the shift amount 
was SImode, and the difference causes vectorization to fail.


OK for mainline?

Andrewopenmp-simd-clone: Match shift types

Ensure that both parameters to vector shifts use the same mode.  This is most
important for amdgcn where the masks are DImode.

gcc/ChangeLog:

* omp-simd-clone.cc (simd_clone_adjust): Convert shift_cnt to match
the mask type.

diff --git a/gcc/omp-simd-clone.cc b/gcc/omp-simd-clone.cc
index 32649bc3f9a..5d3a90730e7 100644
--- a/gcc/omp-simd-clone.cc
+++ b/gcc/omp-simd-clone.cc
@@ -1305,8 +1305,12 @@ simd_clone_adjust (struct cgraph_node *node)
   build_int_cst (TREE_TYPE (iter1), c));
  gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
}
+ tree shift_cnt_conv = make_ssa_name (TREE_TYPE (mask));
+ g = gimple_build_assign (shift_cnt_conv,
+  fold_convert (TREE_TYPE (mask), shift_cnt));
+ gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
  g = gimple_build_assign (make_ssa_name (TREE_TYPE (mask)),
-  RSHIFT_EXPR, mask, shift_cnt);
+  RSHIFT_EXPR, mask, shift_cnt_conv);
  gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
  mask = gimple_assign_lhs (g);
  g = gimple_build_assign (make_ssa_name (TREE_TYPE (mask)),

Re: [PATCH] openmp-simd-clone: Match shift type