[RFC][PR88838][SVE] Use 32-bit WHILELO in LP64 mode

Kugan Vivekanandarajah Tue, 21 May 2019 19:10:10 -0700

Hi,

Attached RFC patch attempts to use 32-bit WHILELO in LP64 mode to fix
the PR. Bootstarp and regression testing ongoing. In earlier testing,
I ran into an issue related to fwprop. I will tackle that based on the
feedback for the patch.


Thanks,
Kugan

From 4e9837ff9c0c080923f342e83574a6fdba2b3d92 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah <kugan.vivekanandara...@linaro.org>
Date: Tue, 5 Mar 2019 10:01:45 +1100
Subject: [PATCH] pr88838[v2]

As Mentioned in PR88838, this patch  avoid the SXTW by using WHILELO on W
registers instead of X registers.

As mentined in PR, vect_verify_full_masking checks which IV widths are
supported for WHILELO but prefers to go to Pmode width.  This is because
using Pmode allows ivopts to reuse the IV for indices (as in the loads
and store above).  However, it would be better to use a 32-bit WHILELO
with a truncated 64-bit IV if:

(a) the limit is extended from 32 bits.
(b) the detection loop in vect_verify_full_masking detects that using a
    32-bit IV would be correct.

gcc/ChangeLog:

2019-05-22  Kugan Vivekanandarajah  <kugan.vivekanandara...@linaro.org>

	* tree-vect-loop-manip.c (vect_set_loop_masks_directly): If the
	compare_type is not with Pmode size, we will create an IV with
	Pmode size with truncated use (i.e. converted to the correct type).
	* tree-vect-loop.c (vect_verify_full_masking): Find which IV
	widths are supported for WHILELO.

gcc/testsuite/ChangeLog:

2019-05-22  Kugan Vivekanandarajah  <kugan.vivekanandara...@linaro.org>

	* gcc.target/aarch64/pr88838.c: New test.
	* gcc.target/aarch64/sve/while_1.c: Adjust.

Change-Id: Iff52946c28d468078f2cc0868d53edb05325b8ca
---
 gcc/fwprop.c                                   | 13 +++++++
 gcc/testsuite/gcc.target/aarch64/pr88838.c     | 11 ++++++
 gcc/testsuite/gcc.target/aarch64/sve/while_1.c | 16 ++++----
 gcc/tree-vect-loop-manip.c                     | 52 ++++++++++++++++++++++++--
 gcc/tree-vect-loop.c                           | 39 ++++++++++++++++++-
 5 files changed, 117 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr88838.c

diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index cf2c9de..5275ad3 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1358,6 +1358,19 @@ forward_propagate_and_simplify (df_ref use, rtx_insn *def_insn, rtx def_set)
   else
     mode = GET_MODE (*loc);
 
+  /* TODO.  */
+  if (GET_MODE_CLASS (mode) != GET_MODE_CLASS (GET_MODE (reg)))
+    return false;
+  /* TODO. We can't get the mode for
+     (set (reg:VNx16BI 109)
+          (unspec:VNx16BI [
+	    (reg:SI 131)
+	    (reg:SI 106)
+           ] UNSPEC_WHILE_LO))
+     Thus, bailout when it is UNSPEC and MODEs are not compatible.  */
+  if (GET_MODE_CLASS (mode) != GET_MODE_CLASS (GET_MODE (reg))
+      && GET_CODE (SET_SRC (use_set)) == UNSPEC)
+    return false;
   new_rtx = propagate_rtx (*loc, mode, reg, src,
   			   optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn)));
 
diff --git a/gcc/testsuite/gcc.target/aarch64/pr88838.c b/gcc/testsuite/gcc.target/aarch64/pr88838.c
new file mode 100644
index 0000000..9d03c0a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr88838.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-S -O3 -march=arm8.2-a+sve" } */
+
+void
+f (int *restrict x, int *restrict y, int *restrict z, int n)
+{
+    for (int i = 0; i < n; i += 1)
+          x[i] = y[i] + z[i];
+}
+
+/* { dg-final { scan-assembler-not "sxtw" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/while_1.c b/gcc/testsuite/gcc.target/aarch64/sve/while_1.c
index a93a04b..05a4860 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/while_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/while_1.c
@@ -26,14 +26,14 @@
 TEST_ALL (ADD_LOOP)
 
 /* { dg-final { scan-assembler-not {\tuqdec} } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, xzr,} 2 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, x[0-9]+,} 2 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, xzr,} 2 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, x[0-9]+,} 2 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, xzr,} 3 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, x[0-9]+,} 3 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, xzr,} 3 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, x[0-9]+,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, wzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, w[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, wzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, w[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, wzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, w[0-9]+,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, wzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, w[0-9]+,} 3 } } */
 /* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.b, p[0-7]/z, \[x0, x[0-9]+\]\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-7], \[x0, x[0-9]+\]\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tld1h\tz[0-9]+\.h, p[0-7]/z, \[x0, x[0-9]+, lsl 1\]\n} 2 } } */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 77d3dac..d6452a1 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -418,7 +418,20 @@ vect_set_loop_masks_directly (struct loop *loop, loop_vec_info loop_vinfo,
   tree mask_type = rgm->mask_type;
   unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
   poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
-
+  bool convert = false;
+  tree iv_type = NULL_TREE;
+
+  /* If the compare_type is not with Pmode size, we will create an IV with
+     Pmode size with truncated use (i.e. converted to the correct type).
+     This is because using Pmode allows ivopts to reuse the IV for indices
+     (in the loads and store).  */
+  if (known_lt (GET_MODE_BITSIZE (TYPE_MODE (compare_type)),
+		GET_MODE_BITSIZE (Pmode)))
+    {
+      iv_type = build_nonstandard_integer_type (GET_MODE_BITSIZE (Pmode),
+						true);
+      convert = true;
+    }
   /* Calculate the maximum number of scalar values that the rgroup
      handles in total, the number that it handles for each iteration
      of the vector loop, and the number that it should skip during the
@@ -444,12 +457,43 @@ vect_set_loop_masks_directly (struct loop *loop, loop_vec_info loop_vinfo,
      processed.  */
   tree index_before_incr, index_after_incr;
   gimple_stmt_iterator incr_gsi;
+  gimple_stmt_iterator incr_gsi2;
   bool insert_after;
-  tree zero_index = build_int_cst (compare_type, 0);
+  tree zero_index;
   standard_iv_increment_position (loop, &incr_gsi, &insert_after);
-  create_iv (zero_index, nscalars_step, NULL_TREE, loop, &incr_gsi,
-	     insert_after, &index_before_incr, &index_after_incr);
 
+  if (convert)
+    {
+      /* If we are creating IV of Pmode type and converting.  */
+      zero_index = build_int_cst (iv_type, 0);
+      tree step = build_int_cst (iv_type,
+				 LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+      /* Creating IV of Pmode type.  */
+      create_iv (zero_index, step, NULL_TREE, loop, &incr_gsi,
+		 insert_after, &index_before_incr, &index_after_incr);
+      /* Create truncated index_before and after increament.  */
+      tree index_before_incr_trunc = make_ssa_name (compare_type);
+      tree index_after_incr_trunc = make_ssa_name (compare_type);
+      gimple *incr_before_stmt = gimple_build_assign (index_before_incr_trunc,
+						      NOP_EXPR,
+						      index_before_incr);
+      gimple *incr_after_stmt = gimple_build_assign (index_after_incr_trunc,
+						     NOP_EXPR,
+						     index_after_incr);
+      incr_gsi2 = incr_gsi;
+      gsi_insert_before (&incr_gsi2, incr_before_stmt, GSI_NEW_STMT);
+      gsi_insert_after (&incr_gsi, incr_after_stmt, GSI_NEW_STMT);
+      index_before_incr = index_before_incr_trunc;
+      index_after_incr = index_after_incr_trunc;
+      zero_index = build_int_cst (compare_type, 0);
+    }
+  else
+    {
+      /* If the IV is of Pmode compare_type, no convertion needed.  */
+      zero_index = build_int_cst (compare_type, 0);
+      create_iv (zero_index, nscalars_step, NULL_TREE, loop, &incr_gsi,
+		 insert_after, &index_before_incr, &index_after_incr);
+    }
   tree test_index, test_limit, first_limit;
   gimple_stmt_iterator *test_gsi;
   if (might_wrap_p)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index bd81193..2769c86 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1035,6 +1035,30 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   /* Find a scalar mode for which WHILE_ULT is supported.  */
   opt_scalar_int_mode cmp_mode_iter;
   tree cmp_type = NULL_TREE;
+  tree niters_type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
+  tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
+  unsigned HOST_WIDE_INT max_vf = vect_max_vf (loop_vinfo);
+  widest_int iv_limit;
+  bool known_max_iters = max_loop_iterations (loop, &iv_limit);
+  if (known_max_iters)
+    {
+      if (niters_skip)
+	{
+	  /* Add the maximum number of skipped iterations to the
+	     maximum iteration count.  */
+	  if (TREE_CODE (niters_skip) == INTEGER_CST)
+	    iv_limit += wi::to_widest (niters_skip);
+	  else
+	    iv_limit += max_vf - 1;
+	}
+      /* IV_LIMIT is the maximum number of latch iterations, which is also
+	 the maximum in-range IV value.  Round this value down to the previous
+	 vector alignment boundary and then add an extra full iteration.  */
+      poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+      iv_limit = (iv_limit & -(int) known_alignment (vf)) + max_vf;
+    }
+
+  /* Get the vectorization factor in tree form.  */
   FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
     {
       unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
@@ -1045,12 +1069,23 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
 	  if (this_type
 	      && can_produce_all_loop_masks_p (loop_vinfo, this_type))
 	    {
+	      /* See whether zero-based IV would ever generate all-false masks
+		 before wrapping around.  */
+	      bool might_wrap_p
+		= (!known_max_iters
+		   || (wi::min_precision
+		       (iv_limit
+			* vect_get_max_nscalars_per_iter (loop_vinfo),
+			UNSIGNED) > cmp_bits));
 	      /* Although we could stop as soon as we find a valid mode,
 		 it's often better to continue until we hit Pmode, since the
 		 operands to the WHILE are more likely to be reusable in
-		 address calculations.  */
+		 address calculations.  Unless the limit is extended from
+		 this_type.  */
 	      cmp_type = this_type;
-	      if (cmp_bits >= GET_MODE_BITSIZE (Pmode))
+	      if (cmp_bits >= GET_MODE_BITSIZE (Pmode)
+		  || (!might_wrap_p
+		      && (cmp_bits == TYPE_PRECISION (niters_type))))
 		break;
 	    }
 	}
-- 
2.7.4

[RFC][PR88838][SVE] Use 32-bit WHILELO in LP64 mode

Reply via email to