Re: [PATCH][expmed] Properly account for the cost and latency of shift+add ops when synthesizing mults

Kyrill Tkachov Mon, 20 Apr 2015 08:13:00 -0700


On 20/04/15 16:06, Jeff Law wrote:

On 04/20/2015 05:09 AM, Kyrill Tkachov wrote:

Hi Jeff,


On 17/04/15 20:38, Jeff Law wrote:

On 04/14/2015 02:11 AM, Kyrill Tkachov wrote:

Of course the effect on codegen of this patch depends a lot on the rtx
costs in the backend.
On aarch64 with -mcpu=cortex-a57 tuning I see the cost limit being
exceeded in more cases and the
expansion code choosing instead to do a move-immediate and a mul
instruction.
No regressions on SPEC2006 on a Cortex-A57.

For example, for code:
long f0 (int x, int y)
{
     return (long)x * 6L;
}


int f1(int x)
{
     return x * 10;
}

int f2(int x)
{
       return x * 100;
}

int f3(int x)
{
       return x * 20;
}

int f4(int x)
{
       return x * 25;
}

int f5(int x)
{
         return x * 11;
}

Please turn this into a test for the testsuite.  It's fine if this the
test is specific to AArch64.  You may need to break it into 6 individual
tests since what you want to check for in each one may be significantly
different.  For example, f0, f4 and f5 you'd probably check for the
constant load & multiply instructions.  Not sure how to best test for
what you want in f1-f3.

f1/f3 still end up synthesising the mult, but prefer a different
algorithm. I don't think the algorithm chosen in f1/f3 is worse or
better than what it was producing before, so I don't think there's
much point in testing for it.

Yea, when I looked at the differences, it wasn't immediately clear if
there was a real improvement or not.  The new sequences use the single
register, so one might claim they're marginally better due to that.

   If you think it's really better to

test for something, I propose testing that only two instructions are
generated, and neither of them are a 'mul'. I'll repost a patch with
my proposed testcases for f0,f2,f4,f5.

Your call on f1/f3.  IIRC f2 didn't change at all, so I'm not sure if
you need a test for that (perhaps you should make sure it continues to
use a mul rather than a synthesized sequence).

Pre-approved with whatever you decide on the testing side.


Thanks,
I could've sworn I had sent this version out a couple hours ago.
My mail client has been playing up.

Here it is with 6 tests. For the tests corresponding to f1/f3 in my
example above I scan that we don't use the 'w1' reg.

I'll give the AArch64 maintainers to comment on the tests for a day or two
before committing.

Thanks,
Kyrill

2015-04-20  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

    * expmed.c: (synth_mult): Only assume overlapping
    shift with previous steps in alg_sub_t_m2 case.

2015-04-20  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

    * gcc.target/aarch64/mult-synth_1.c: New test.
    * gcc.target/aarch64/mult-synth_2.c: Likewise.
    * gcc.target/aarch64/mult-synth_3.c: Likewise.
    * gcc.target/aarch64/mult-synth_4.c: Likewise.
    * gcc.target/aarch64/mult-synth_5.c: Likewise.
    * gcc.target/aarch64/mult-synth_6.c: Likewise.


jeff

commit 6dddd785f2839811c9f8e59e3e36f64cdb3a4ebe
Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com>
Date:   Thu Mar 12 17:38:20 2015 +0000

    [expmed] Properly account for the cost and latency of shift+sub ops when synthesizing mults

diff --git a/gcc/expmed.c b/gcc/expmed.c
index f570612..e890a1e 100644
--- a/gcc/expmed.c
+++ b/gcc/expmed.c
@@ -2664,14 +2664,28 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t,
       m = exact_log2 (-orig_t + 1);
       if (m >= 0 && m < maxm)
 	{
-	  op_cost = shiftsub1_cost (speed, mode, m);
+	  op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
+	  /* If the target has a cheap shift-and-subtract insn use
+	     that in preference to a shift insn followed by a sub insn.
+	     Assume that the shift-and-sub is "atomic" with a latency
+	     equal to it's cost, otherwise assume that on superscalar
+	     hardware the shift may be executed concurrently with the
+	     earlier steps in the algorithm.  */
+	  if (shiftsub1_cost (speed, mode, m) <= op_cost)
+	    {
+	      op_cost = shiftsub1_cost (speed, mode, m);
+	      op_latency = op_cost;
+	    }
+	  else
+	    op_latency = add_cost (speed, mode);
+
 	  new_limit.cost = best_cost.cost - op_cost;
-	  new_limit.latency = best_cost.latency - op_cost;
+	  new_limit.latency = best_cost.latency - op_latency;
 	  synth_mult (alg_in, (unsigned HOST_WIDE_INT) (-orig_t + 1) >> m,
 		      &new_limit, mode);
 
 	  alg_in->cost.cost += op_cost;
-	  alg_in->cost.latency += op_cost;
+	  alg_in->cost.latency += op_latency;
 	  if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
 	    {
 	      best_cost = alg_in->cost;
@@ -2704,20 +2718,12 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t,
       if (t % d == 0 && t > d && m < maxm
 	  && (!cache_hit || cache_alg == alg_add_factor))
 	{
-	  /* If the target has a cheap shift-and-add instruction use
-	     that in preference to a shift insn followed by an add insn.
-	     Assume that the shift-and-add is "atomic" with a latency
-	     equal to its cost, otherwise assume that on superscalar
-	     hardware the shift may be executed concurrently with the
-	     earlier steps in the algorithm.  */
 	  op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
-	  if (shiftadd_cost (speed, mode, m) < op_cost)
-	    {
-	      op_cost = shiftadd_cost (speed, mode, m);
-	      op_latency = op_cost;
-	    }
-	  else
-	    op_latency = add_cost (speed, mode);
+	  if (shiftadd_cost (speed, mode, m) <= op_cost)
+	    op_cost = shiftadd_cost (speed, mode, m);
+
+	  op_latency = op_cost;
+
 
 	  new_limit.cost = best_cost.cost - op_cost;
 	  new_limit.latency = best_cost.latency - op_latency;
@@ -2742,20 +2748,11 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t,
       if (t % d == 0 && t > d && m < maxm
 	  && (!cache_hit || cache_alg == alg_sub_factor))
 	{
-	  /* If the target has a cheap shift-and-subtract insn use
-	     that in preference to a shift insn followed by a sub insn.
-	     Assume that the shift-and-sub is "atomic" with a latency
-	     equal to it's cost, otherwise assume that on superscalar
-	     hardware the shift may be executed concurrently with the
-	     earlier steps in the algorithm.  */
 	  op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
-	  if (shiftsub0_cost (speed, mode, m) < op_cost)
-	    {
-	      op_cost = shiftsub0_cost (speed, mode, m);
-	      op_latency = op_cost;
-	    }
-	  else
-	    op_latency = add_cost (speed, mode);
+	  if (shiftsub0_cost (speed, mode, m) <= op_cost)
+	    op_cost = shiftsub0_cost (speed, mode, m);
+
+	  op_latency = op_cost;
 
 	  new_limit.cost = best_cost.cost - op_cost;
 	  new_limit.latency = best_cost.latency - op_latency;
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_1.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_1.c
new file mode 100644
index 0000000..8b6ee70
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_1.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+  return x * 100;
+}
+
+/* { dg-final { scan-assembler "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_2.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_2.c
new file mode 100644
index 0000000..9278817
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+  return x * 25;
+}
+
+/* { dg-final { scan-assembler "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_3.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_3.c
new file mode 100644
index 0000000..6898e78
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_3.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+  return x * 11;
+}
+
+/* { dg-final { scan-assembler "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_4.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_4.c
new file mode 100644
index 0000000..1088a75
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_4.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+long
+foo (int x, int y)
+{
+   return (long)x * 6L;
+}
+
+/* { dg-final { scan-assembler "smull\tx\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_5.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_5.c
new file mode 100644
index 0000000..8cf3314
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_5.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+  return x * 10;
+}
+
+/* { dg-final { scan-assembler-not "\tw1" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_6.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_6.c
new file mode 100644
index 0000000..e941b72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_6.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+  return x * 20;
+}
+
+/* { dg-final { scan-assembler-not "\tw1" } } */
+/* { dg-final { cleanup-saved-temps } } */

Re: [PATCH][expmed] Properly account for the cost and latency of shift+add ops when synthesizing mults

Reply via email to