Re: [PATCH] Optimise sqrt reciprocal multiplications

Kyrill Tkachov Tue, 04 Sep 2018 09:53:24 -0700


On 04/09/18 15:31, Richard Biener wrote:

On Tue, 4 Sep 2018, Kyrill Tkachov wrote:

Hi Richard,

On 31/08/18 12:07, Richard Biener wrote:

On Thu, 30 Aug 2018, Kyrill Tkachov wrote:

Ping.

https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01496.html

Thanks,
Kyrill

On 23/08/18 18:09, Kyrill Tkachov wrote:

Hi Richard,

On 23/08/18 11:13, Richard Sandiford wrote:

Kyrill  Tkachov <kyrylo.tkac...@foss.arm.com> writes:

Hi all,

This patch aims to optimise sequences involving uses of 1.0 / sqrt (a)
under -freciprocal-math and -funsafe-math-optimizations.
In particular consider:

x = 1.0 / sqrt (a);
r1 = x * x;  // same as 1.0 / a
r2 = a * x; // same as sqrt (a)

If x, r1 and r2 are all used further on in the code, this can be
transformed into:
tmp1 = 1.0 / a
tmp2 = sqrt (a)
tmp3 = tmp1 * tmp2
x = tmp3
r1 = tmp1
r2 = tmp2

Nice optimisation :-)  Someone who knows the pass better should review,
but:

Thanks for the review.

There seems to be an implicit assumption that this is a win even
when the r1 and r2 assignments are only conditionally executed.
That's probably true, but it might be worth saying explicitly.

I'll admit I had not considered that case.
I think it won't make a difference in practice, as the really expensive
operations here
are the sqrt and the division and they are on the executed path in either
case and them
becoming independent should be a benefit of its own.

+/* Return TRUE if USE_STMT is a multiplication of DEF by A.  */
+
+static inline bool
+is_mult_by (gimple *use_stmt, tree def, tree a)
+{
+  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
+      && gimple_assign_rhs_code (use_stmt) == MULT_EXPR)
+    {
+      tree op0 = gimple_assign_rhs1 (use_stmt);
+      tree op1 = gimple_assign_rhs2 (use_stmt);
+
+      return (op0 == def && op1 == a)
+          || (op0 == a && op1 == def);
+    }
+  return 0;
+}

Seems like is_square_of could now be a light-weight wrapper around this.

Indeed, I've done the wrapping now.

@@ -652,6 +669,180 @@ execute_cse_reciprocals_1 (gimple_stmt_iterator
*def_gsi, tree def)
     occ_head = NULL;
   }
   +/* Transform sequences like
+   x = 1.0 / sqrt (a);
+   r1 = x * x;
+   r2 = a * x;
+   into:
+   tmp1 = 1.0 / a;
+   tmp2 = sqrt (a);
+   tmp3 = tmp1 * tmp2;
+   x = tmp3;
+   r1 = tmp1;
+   r2 = tmp2;
+   depending on the uses of x, r1, r2.  This removes one multiplication
and
+   allows the sqrt and division operations to execute in parallel.
+   DEF_GSI is the gsi of the initial division by sqrt that defines
+   DEF (x in the example abovs).  */
+
+static void
+optimize_recip_sqrt (gimple_stmt_iterator *def_gsi, tree def)
+{
+  use_operand_p use_p;
+  imm_use_iterator use_iter;
+  gimple *stmt = gsi_stmt (*def_gsi);
+  tree x = def;
+  tree orig_sqrt_ssa_name = gimple_assign_rhs2 (stmt);
+  tree div_rhs1 = gimple_assign_rhs1 (stmt);
+
+  if (TREE_CODE (orig_sqrt_ssa_name) != SSA_NAME
+      || TREE_CODE (div_rhs1) != REAL_CST
+      || !real_equal (&TREE_REAL_CST (div_rhs1), &dconst1))
+    return;
+
+  gimple *sqrt_stmt = SSA_NAME_DEF_STMT (orig_sqrt_ssa_name);
+  if (!is_gimple_call (sqrt_stmt)
+      || !gimple_call_lhs (sqrt_stmt))
+    return;
+
+  gcall *call = as_a <gcall *> (sqrt_stmt);

Very minor, but:

    gcall *sqrt_stmt
      = dyn_cast <gcall *> (SSA_NAME_DEF_STMT (orig_sqrt_ssa_name));
    if (!sqrt_stmt || !gimple_call_lhs (sqrt_stmt))
      return;

would avoid the need for the separate as_a<>, and would mean that
we only call gimple_call_* on gcalls.

Ok.

+  if (has_other_use)
+    {
+      /* Using the two temporaries tmp1, tmp2 from above
+     the original x is now:
+     x = tmp1 * tmp2.  */
+      gcc_assert (mult_ssa_name);
+      gcc_assert (sqr_ssa_name);
+      gimple_stmt_iterator gsi2 = gsi_for_stmt (stmt);
+
+      tree new_ssa_name
+    = make_temp_ssa_name (TREE_TYPE (a), NULL,
"recip_sqrt_transformed");
+      gimple *new_stmt
+    = gimple_build_assign (new_ssa_name, MULT_EXPR,
+                   mult_ssa_name, sqr_ssa_name);
+      gsi_insert_before (&gsi2, new_stmt, GSI_SAME_STMT);
+      gcc_assert (gsi_stmt (gsi2) == stmt);
+      gimple_assign_set_rhs_from_tree (&gsi2, new_ssa_name);
+      fold_stmt (&gsi2);
+      update_stmt (stmt);

In this case we're replacing the statement in its original position,
so there's no real need to use a temporary.  It seems better to
change the rhs_code, rhs1 and rhs2 of stmt in-place, with the same
lhs as before.

Yes, that's cleaner.

@@ -762,6 +953,23 @@ pass_cse_reciprocals::execute (function *fun)
         if (optimize_bb_for_size_p (bb))
           continue;

Seems unnecessary to skip the new optimisation when optimising for size.
Like you say, it saves a multiplication overall. Also:

Indeed.

+      if (flag_unsafe_math_optimizations)
+    {
+      for (gimple_stmt_iterator gsi = gsi_after_labels (bb);
+           !gsi_end_p (gsi);
+           gsi_next (&gsi))
+        {
+          gimple *stmt = gsi_stmt (gsi);
+
+          if (gimple_has_lhs (stmt)
+          && (def = SINGLE_SSA_TREE_OPERAND (stmt, SSA_OP_DEF)) != NULL
+          && FLOAT_TYPE_P (TREE_TYPE (def))
+          && TREE_CODE (def) == SSA_NAME
+          && is_gimple_assign (stmt)
+          && gimple_assign_rhs_code (stmt) == RDIV_EXPR)
+        optimize_recip_sqrt (&gsi, def);
+        }
+    }

It looks like this could safely be done in one of the existing walks
(e.g. the execute_cse_reciprocals_1 one, if we do this when optimising
for size).

You're right. I've moved this into one of the walks above this.

Bootstrapped and tested on aarch64-none-linux-gnu and
x86_64-unknown-linux-gnu.
CC'ing richi as he's reviewed these kinds of patches in the past.

Is this ok for trunk?

I wonder how it interacts with execute_cse_reciprocals_1 given it
introduces a division 1.0 / a which will be not processed by
execute_cse_reciprocals_1 given that operates by walking uses of 'a'.
That may be just a missed optimization of course.

Hmm, I believe right now it doesn't interact with execute_cse_reciprocals_1 as
it's either one or the other.
I've left it as it is for now, but would you like us to call
execute_cse_reciprocals_1 on the potentially transformed division?

I think that wouldn't "fix" it given execute_cse_reciprocals_1 works by
seeing all reciprocal uses of 'a' so it may have alrady seen two and
you introduced the third.  But we can leave "solving" this for
future enhacements (I'd avoid doing two passes over the IL just because
of said possibility right now).

+  if (has_other_use)
+    {
+      /* Using the two temporaries tmp1, tmp2 from above
+        the original x is now:
+        x = tmp1 * tmp2.  */
+      gcc_assert (mult_ssa_name);
+      gcc_assert (sqr_ssa_name);
+
+      gimple_assign_set_rhs1 (stmt, mult_ssa_name);
+      gimple_assign_set_rhs2 (stmt, sqr_ssa_name);
+      gimple_assign_set_rhs_code (stmt, MULT_EXPR);
+      fold_stmt_inplace (def_gsi);
+      update_stmt (stmt);
+    }

so you are leaving the original stmt in place unchanged even if it is
not used?  Why?  Note that with -fno-call-exceptions this stmt may
throw, so you should arrange to code-generate 1./a in place of the
original division to preserve EH behavior.  Watch out because then
with other uses you have to find a place to insert its computation.

Ok. These are oversights on my part. I've updated the patch to modify the
original division in place. The multiplication is placed after it.

If the division throws internally you can't insert after it, you
have to insert on the single non-EH edge outgoing from the BB
with the division.  Just try a C++ testcase with -fnon-call-exceptions
and a try {} catch(...) {} around.

Not sure if it's worth to handle so bailing out if
stmt_can_throw_internal (stmt) would be an option as well.

+      if (is_square_of (stmt2, x))
+       {
+         if (!sqr_stmts.contains (stmt2))
+           sqr_stmts.safe_push (stmt2);
+       }

this is quadratic in the number of square stmts...  please consider
making sqr_stmts a bitmap of SSA defs (so the stmt you have now
is then SSA_NAME_DEF_STMT (ssa_name (bitmap-element))).

Done. In practice I didn't see there being more than one such use, I expect
them
to be CSE'd, but maybe if they're in different basic blocks...

Oh, it just occured to me you could use FOR_EACH_IMM_USE_STMT ()
which will only get you each stmt once... ;)  Sorry for the misleading
bitmap suggestion.

You do not seem to restrict placement of the use stmts but insert
before the 1/sqrt(a) stmt.  That possibly hoists the multiplications
where they are not needed.  Consider

   x = 1./sqrt(a);
   if (l)
     l1 = x * 3.;
   else if (l2)
     l1 = x * x;
   else if (l3)
     l1 = a * x;

or similar where on the path to x * 3. you now perform two extra
multiplications.

Ok, I've restricted the optimisation somewhat.
Now if there's an other use it won't perform the transformation unless there
is already
a multiplication present on the main path. That way it won't introduce
multiplications
on paths where there aren't any.

OK.

Your testcases do not cover the case of other uses at all.  Or of
EH.

gcc.dg/recip_sqrt_mult_1.c should handle the other uses case as tmp is a
global
being written to. I've added more testcases involving uses in different basic
blocks
and a g++.dg testcase with -fnon-call-exceptions.
I'm not sure what functionality to test though apart from that it doesn't ICE
and does
the transformations I expect.

That's good enough I guess.  I see you do have a testcase with
try/catch so I wonder why foo4 doesn't ICE when you insert after
the throwing stmt...  maybe it doesn't throw?  Ah - they probably
fail your new mult_on_main_path test?


Yeah. I did manage to reproduce an ICE eventually by removing that check
and enabling -ftrapping-math.

I think I'll just not enter this at all if stmt_can_throw_internal (stmt) like 
you suggested.

Thanks,
Kyrill

Richard.

Bootstrapped and tested on aarch64-none-linux-gnu.
Is this version better?

Thanks,
Kyrill


2018-09-03  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

     * tree-ssa-math-opts.c (is_mult_by): New function.
     (is_square_of): Use the above.
     (optimize_recip_sqrt): New function.
     (pass_cse_reciprocals::execute): Use the above.

2018-09-03  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

     * gcc.dg/recip_sqrt_mult_1.c: New test.
     * gcc.dg/recip_sqrt_mult_2.c: Likewise.
     * gcc.dg/recip_sqrt_mult_3.c: Likewise.
     * gcc.dg/recip_sqrt_mult_4.c: Likewise.
     * gcc.dg/recip_sqrt_mult_5.c: Likewise.
     * g++.dg/recip_sqrt_mult_1.c: Likewise.

Re: [PATCH] Optimise sqrt reciprocal multiplications

Reply via email to