On Wed, 5 Sep 2018, Kyrill Tkachov wrote: > On 04/09/18 17:52, Kyrill Tkachov wrote: > > > > On 04/09/18 15:31, Richard Biener wrote: > >> On Tue, 4 Sep 2018, Kyrill Tkachov wrote: > >>> Hi Richard, > >>> > >>> On 31/08/18 12:07, Richard Biener wrote: > >>>> On Thu, 30 Aug 2018, Kyrill Tkachov wrote: > >>>>> Ping. > >>>>> > >>>>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01496.html > >>>>> > >>>>> Thanks, > >>>>> Kyrill > >>>>> > >>>>> On 23/08/18 18:09, Kyrill Tkachov wrote: > >>>>>> Hi Richard, > >>>>>> > >>>>>> On 23/08/18 11:13, Richard Sandiford wrote: > >>>>>>> Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> writes: > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> This patch aims to optimise sequences involving uses of 1.0 / sqrt > (a) > >>>>>>>> under -freciprocal-math and -funsafe-math-optimizations. > >>>>>>>> In particular consider: > >>>>>>>> > >>>>>>>> x = 1.0 / sqrt (a); > >>>>>>>> r1 = x * x; // same as 1.0 / a > >>>>>>>> r2 = a * x; // same as sqrt (a) > >>>>>>>> > >>>>>>>> If x, r1 and r2 are all used further on in the code, this can be > >>>>>>>> transformed into: > >>>>>>>> tmp1 = 1.0 / a > >>>>>>>> tmp2 = sqrt (a) > >>>>>>>> tmp3 = tmp1 * tmp2 > >>>>>>>> x = tmp3 > >>>>>>>> r1 = tmp1 > >>>>>>>> r2 = tmp2 > >>>>>>> Nice optimisation :-) Someone who knows the pass better should > review, > >>>>>>> but: > >>>>>> Thanks for the review. > >>>>>>> There seems to be an implicit assumption that this is a win even > >>>>>>> when the r1 and r2 assignments are only conditionally executed. > >>>>>>> That's probably true, but it might be worth saying explicitly. > >>>>>> I'll admit I had not considered that case. > >>>>>> I think it won't make a difference in practice, as the really expensive > >>>>>> operations here > >>>>>> are the sqrt and the division and they are on the executed path in > either > >>>>>> case and them > >>>>>> becoming independent should be a benefit of its own. > >>>>>>>> +/* Return TRUE if USE_STMT is a multiplication of DEF by A. */ > >>>>>>>> + > >>>>>>>> +static inline bool > >>>>>>>> +is_mult_by (gimple *use_stmt, tree def, tree a) > >>>>>>>> +{ > >>>>>>>> + if (gimple_code (use_stmt) == GIMPLE_ASSIGN > >>>>>>>> + && gimple_assign_rhs_code (use_stmt) == MULT_EXPR) > >>>>>>>> + { > >>>>>>>> + tree op0 = gimple_assign_rhs1 (use_stmt); > >>>>>>>> + tree op1 = gimple_assign_rhs2 (use_stmt); > >>>>>>>> + > >>>>>>>> + return (op0 == def && op1 == a) > >>>>>>>> + || (op0 == a && op1 == def); > >>>>>>>> + } > >>>>>>>> + return 0; > >>>>>>>> +} > >>>>>>> Seems like is_square_of could now be a light-weight wrapper around > this. > >>>>>> Indeed, I've done the wrapping now. > >>>>>>>> @@ -652,6 +669,180 @@ execute_cse_reciprocals_1 (gimple_stmt_iterator > >>>>>>>> *def_gsi, tree def) > >>>>>>>> occ_head = NULL; > >>>>>>>> } > >>>>>>>> +/* Transform sequences like > >>>>>>>> + x = 1.0 / sqrt (a); > >>>>>>>> + r1 = x * x; > >>>>>>>> + r2 = a * x; > >>>>>>>> + into: > >>>>>>>> + tmp1 = 1.0 / a; > >>>>>>>> + tmp2 = sqrt (a); > >>>>>>>> + tmp3 = tmp1 * tmp2; > >>>>>>>> + x = tmp3; > >>>>>>>> + r1 = tmp1; > >>>>>>>> + r2 = tmp2; > >>>>>>>> + depending on the uses of x, r1, r2. This removes one > multiplication > >>>>>>>> and > >>>>>>>> + allows the sqrt and division operations to execute in parallel. > >>>>>>>> + DEF_GSI is the gsi of the initial division by sqrt that defines > >>>>>>>> + DEF (x in the example abovs). */ > >>>>>>>> + > >>>>>>>> +static void > >>>>>>>> +optimize_recip_sqrt (gimple_stmt_iterator *def_gsi, tree def) > >>>>>>>> +{ > >>>>>>>> + use_operand_p use_p; > >>>>>>>> + imm_use_iterator use_iter; > >>>>>>>> + gimple *stmt = gsi_stmt (*def_gsi); > >>>>>>>> + tree x = def; > >>>>>>>> + tree orig_sqrt_ssa_name = gimple_assign_rhs2 (stmt); > >>>>>>>> + tree div_rhs1 = gimple_assign_rhs1 (stmt); > >>>>>>>> + > >>>>>>>> + if (TREE_CODE (orig_sqrt_ssa_name) != SSA_NAME > >>>>>>>> + || TREE_CODE (div_rhs1) != REAL_CST > >>>>>>>> + || !real_equal (&TREE_REAL_CST (div_rhs1), &dconst1)) > >>>>>>>> + return; > >>>>>>>> + > >>>>>>>> + gimple *sqrt_stmt = SSA_NAME_DEF_STMT (orig_sqrt_ssa_name); > >>>>>>>> + if (!is_gimple_call (sqrt_stmt) > >>>>>>>> + || !gimple_call_lhs (sqrt_stmt)) > >>>>>>>> + return; > >>>>>>>> + > >>>>>>>> + gcall *call = as_a <gcall *> (sqrt_stmt); > >>>>>>> Very minor, but: > >>>>>>> > >>>>>>> gcall *sqrt_stmt > >>>>>>> = dyn_cast <gcall *> (SSA_NAME_DEF_STMT (orig_sqrt_ssa_name)); > >>>>>>> if (!sqrt_stmt || !gimple_call_lhs (sqrt_stmt)) > >>>>>>> return; > >>>>>>> > >>>>>>> would avoid the need for the separate as_a<>, and would mean that > >>>>>>> we only call gimple_call_* on gcalls. > >>>>>> Ok. > >>>>>>>> + if (has_other_use) > >>>>>>>> + { > >>>>>>>> + /* Using the two temporaries tmp1, tmp2 from above > >>>>>>>> + the original x is now: > >>>>>>>> + x = tmp1 * tmp2. */ > >>>>>>>> + gcc_assert (mult_ssa_name); > >>>>>>>> + gcc_assert (sqr_ssa_name); > >>>>>>>> + gimple_stmt_iterator gsi2 = gsi_for_stmt (stmt); > >>>>>>>> + > >>>>>>>> + tree new_ssa_name > >>>>>>>> + = make_temp_ssa_name (TREE_TYPE (a), NULL, > >>>>>>>> "recip_sqrt_transformed"); > >>>>>>>> + gimple *new_stmt > >>>>>>>> + = gimple_build_assign (new_ssa_name, MULT_EXPR, > >>>>>>>> + mult_ssa_name, sqr_ssa_name); > >>>>>>>> + gsi_insert_before (&gsi2, new_stmt, GSI_SAME_STMT); > >>>>>>>> + gcc_assert (gsi_stmt (gsi2) == stmt); > >>>>>>>> + gimple_assign_set_rhs_from_tree (&gsi2, new_ssa_name); > >>>>>>>> + fold_stmt (&gsi2); > >>>>>>>> + update_stmt (stmt); > >>>>>>> In this case we're replacing the statement in its original position, > >>>>>>> so there's no real need to use a temporary. It seems better to > >>>>>>> change the rhs_code, rhs1 and rhs2 of stmt in-place, with the same > >>>>>>> lhs as before. > >>>>>> Yes, that's cleaner. > >>>>>>>> @@ -762,6 +953,23 @@ pass_cse_reciprocals::execute (function *fun) > >>>>>>>> if (optimize_bb_for_size_p (bb)) > >>>>>>>> continue; > >>>>>>> Seems unnecessary to skip the new optimisation when optimising for > size. > >>>>>>> Like you say, it saves a multiplication overall. Also: > >>>>>> Indeed. > >>>>>>>> + if (flag_unsafe_math_optimizations) > >>>>>>>> + { > >>>>>>>> + for (gimple_stmt_iterator gsi = gsi_after_labels (bb); > >>>>>>>> + !gsi_end_p (gsi); > >>>>>>>> + gsi_next (&gsi)) > >>>>>>>> + { > >>>>>>>> + gimple *stmt = gsi_stmt (gsi); > >>>>>>>> + > >>>>>>>> + if (gimple_has_lhs (stmt) > >>>>>>>> + && (def = SINGLE_SSA_TREE_OPERAND (stmt, SSA_OP_DEF)) != > NULL > >>>>>>>> + && FLOAT_TYPE_P (TREE_TYPE (def)) > >>>>>>>> + && TREE_CODE (def) == SSA_NAME > >>>>>>>> + && is_gimple_assign (stmt) > >>>>>>>> + && gimple_assign_rhs_code (stmt) == RDIV_EXPR) > >>>>>>>> + optimize_recip_sqrt (&gsi, def); > >>>>>>>> + } > >>>>>>>> + } > >>>>>>> It looks like this could safely be done in one of the existing walks > >>>>>>> (e.g. the execute_cse_reciprocals_1 one, if we do this when optimising > >>>>>>> for size). > >>>>>> You're right. I've moved this into one of the walks above this. > >>>>>> > >>>>>> Bootstrapped and tested on aarch64-none-linux-gnu and > >>>>>> x86_64-unknown-linux-gnu. > >>>>>> CC'ing richi as he's reviewed these kinds of patches in the past. > >>>>>> > >>>>>> Is this ok for trunk? > >>>> I wonder how it interacts with execute_cse_reciprocals_1 given it > >>>> introduces a division 1.0 / a which will be not processed by > >>>> execute_cse_reciprocals_1 given that operates by walking uses of 'a'. > >>>> That may be just a missed optimization of course. > >>> Hmm, I believe right now it doesn't interact with > execute_cse_reciprocals_1 as > >>> it's either one or the other. > >>> I've left it as it is for now, but would you like us to call > >>> execute_cse_reciprocals_1 on the potentially transformed division? > >> I think that wouldn't "fix" it given execute_cse_reciprocals_1 works by > >> seeing all reciprocal uses of 'a' so it may have alrady seen two and > >> you introduced the third. But we can leave "solving" this for > >> future enhacements (I'd avoid doing two passes over the IL just because > >> of said possibility right now). > >>>> + if (has_other_use) > >>>> + { > >>>> + /* Using the two temporaries tmp1, tmp2 from above > >>>> + the original x is now: > >>>> + x = tmp1 * tmp2. */ > >>>> + gcc_assert (mult_ssa_name); > >>>> + gcc_assert (sqr_ssa_name); > >>>> + > >>>> + gimple_assign_set_rhs1 (stmt, mult_ssa_name); > >>>> + gimple_assign_set_rhs2 (stmt, sqr_ssa_name); > >>>> + gimple_assign_set_rhs_code (stmt, MULT_EXPR); > >>>> + fold_stmt_inplace (def_gsi); > >>>> + update_stmt (stmt); > >>>> + } > >>>> > >>>> so you are leaving the original stmt in place unchanged even if it is > >>>> not used? Why? Note that with -fno-call-exceptions this stmt may > >>>> throw, so you should arrange to code-generate 1./a in place of the > >>>> original division to preserve EH behavior. Watch out because then > >>>> with other uses you have to find a place to insert its computation. > >>> Ok. These are oversights on my part. I've updated the patch to modify the > >>> original division in place. The multiplication is placed after it. > >> If the division throws internally you can't insert after it, you > >> have to insert on the single non-EH edge outgoing from the BB > >> with the division. Just try a C++ testcase with -fnon-call-exceptions > >> and a try {} catch(...) {} around. > >> > >> Not sure if it's worth to handle so bailing out if > >> stmt_can_throw_internal (stmt) would be an option as well. > >>>> + if (is_square_of (stmt2, x)) > >>>> + { > >>>> + if (!sqr_stmts.contains (stmt2)) > >>>> + sqr_stmts.safe_push (stmt2); > >>>> + } > >>>> > >>>> this is quadratic in the number of square stmts... please consider > >>>> making sqr_stmts a bitmap of SSA defs (so the stmt you have now > >>>> is then SSA_NAME_DEF_STMT (ssa_name (bitmap-element))). > >>> Done. In practice I didn't see there being more than one such use, I > expect > >>> them > >>> to be CSE'd, but maybe if they're in different basic blocks... > >> Oh, it just occured to me you could use FOR_EACH_IMM_USE_STMT () > >> which will only get you each stmt once... ;) Sorry for the misleading > >> bitmap suggestion. > >>>> You do not seem to restrict placement of the use stmts but insert > >>>> before the 1/sqrt(a) stmt. That possibly hoists the multiplications > >>>> where they are not needed. Consider > >>>> > >>>> x = 1./sqrt(a); > >>>> if (l) > >>>> l1 = x * 3.; > >>>> else if (l2) > >>>> l1 = x * x; > >>>> else if (l3) > >>>> l1 = a * x; > >>>> > >>>> or similar where on the path to x * 3. you now perform two extra > >>>> multiplications. > >>> Ok, I've restricted the optimisation somewhat. > >>> Now if there's an other use it won't perform the transformation unless > there > >>> is already > >>> a multiplication present on the main path. That way it won't introduce > >>> multiplications > >>> on paths where there aren't any. > >> OK. > >>>> Your testcases do not cover the case of other uses at all. Or of > >>>> EH. > >>> gcc.dg/recip_sqrt_mult_1.c should handle the other uses case as tmp is a > >>> global > >>> being written to. I've added more testcases involving uses in different > basic > >>> blocks > >>> and a g++.dg testcase with -fnon-call-exceptions. > >>> I'm not sure what functionality to test though apart from that it doesn't > ICE > >>> and does > >>> the transformations I expect. > >> That's good enough I guess. I see you do have a testcase with > >> try/catch so I wonder why foo4 doesn't ICE when you insert after > >> the throwing stmt... maybe it doesn't throw? Ah - they probably > >> fail your new mult_on_main_path test? > > > > Yeah. I did manage to reproduce an ICE eventually by removing that check > > and enabling -ftrapping-math. > > > > I think I'll just not enter this at all if stmt_can_throw_internal (stmt) > like you suggested. > > > And here is the version that does that. It also reverts to using a vector for > the sqr_stmts for consistency > and uses FOR_EACH_IMM_USE_STMT to iterate over the use statements. > > Bootstrapped and tested on aarch64-none-linux-gnu. > > Ok?
OK. Thanks, Richard. > Thanks, > Kyrill > > 2018-09-03 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * tree-ssa-math-opts.c (is_mult_by): New function. > (is_square_of): Use the above. > (optimize_recip_sqrt): New function. > (pass_cse_reciprocals::execute): Use the above. > > 2018-09-03 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * gcc.dg/recip_sqrt_mult_1.c: New test. > * gcc.dg/recip_sqrt_mult_2.c: Likewise. > * gcc.dg/recip_sqrt_mult_3.c: Likewise. > * gcc.dg/recip_sqrt_mult_4.c: Likewise. > * gcc.dg/recip_sqrt_mult_5.c: Likewise. > * g++.dg/recip_sqrt_mult_1.C: Likewise. > * g++.dg/recip_sqrt_mult_2.C: Likewise.