On Wed, May 2, 2018 at 12:44 AM, Jeff Law <l...@redhat.com> wrote: > > Time to come back to this now that the trunk is open again... > > > -------- Forwarded Message -------- > Subject: [GCC 9][RFC][PATCH] Optimize PHIs with constant arguments better > Date: Thu, 30 Nov 2017 16:24:33 -0700 > From: Jeff Law <l...@redhat.com> > To: gcc-patches <gcc-patches@gcc.gnu.org> > > > > This addresses some code quality regressions with some work scheduled > for gcc-9. However, if anyone is aware of BZs that could be helped with > this patch, don't hesitate to pass them along. Depending on their > severity we could always re-evaluate the plan for this patch. > > The patch optimizes sequences like this: > > > # iftmp.4_16 = PHI <0(7), 1(8), 0(9)> > _30 = (unsigned char) iftmp.4_16; > > ie, we have a PHI where all its arguments are known (differing) > constants. The PHI feeds an operation which can be compile-time > evaluated for all the PHI's arguments. > > This arises most often where the result of the PHI is type converted via > NOP_EXPR. But it also happens for a boolean comparisons, arithmetic > and logicals and other type conversions. > > We can optimize away the statement by creating a new PHI node holding > the result of the statement applied to each of the original PHI's > constant operands. So for the fragment above we'd generate: > > # iftmp.4_16 = PHI <0(7), 1(8), 0(9)> > # _30 = PHI <0(7), 1(8), 0(9)> > > These kind of backward propagations can cascade. Here's an example I > saw during analysis of test results: > > # m_5 = PHI <10(5), 12(6), 14(7)> > <L13>: > _2 = (long unsigned int) m_5; > _3 = _2 * 32; > goto <bb 12>; [100.00%] > > After this patch the two statements have been eliminated in favor of > generating PHIs with constant arguments: > > And after PHI propagation we have: > # m_5 = PHI <10(5), 12(6), 14(7)> > # _2 = PHI <10(5), 12(6), 14(7)> > # _3 = PHI <320(5), 384(6), 448(7)> > <L13>: > goto <bb 12>; [100.00%] > > DCE will come along and wipe out m_5 and _2 if they are unused. > > I initially covered just NOP_EXPR in the proof of concept. But it's > trivial to extend to cover other unaries, and binary/comparisons where > one operand was already a constant, so I did that. This applies > surprisingly often during a bootstrap. An earlier version (which didn't > handle multiple uses of the result of the PHI) triggered over 6000 times > during a bootstrap: > > NOP_EXPR 5161 > LT_EXPR 658 > GE_EXPR 504 > NE_EXPR 310 > BIT_NOT_EXPR 295 > BIT_AND_EXPR 94 > PLUS_EXPR 77 > NEGATE_EXPR 48 > LSHIFT_EXPR 40 > EQ_EXPR 34 > GT_EXPR 16 > BIT_IOR_EXPR 10 > MINUS_EXPR 8 > > There's actually other cases we could handle were > > x = PHI (a, 0); > y = a & x; > > Turns into > > x = PHI (a, 0); > y = PHI (a, 0); > > > I'm not sure how often these non-constant cases happen -- I haven't > tried to support this or gain any data on how often this kind of case > might happen. > > FWIW, there are cases were the PHI arguments are constant, but we still > can't simplify. For example: > > > x = PHI (0, 1, 2) > > y = 1234 / x; > > > When we process this we'll try to simplify 1234 / 0 to a constant which > fails. We have to gracefully remove the partially initialized PHI node > in that case. This is tested by existing tests in the suite. > > You'll also see that certain tests in the suite such as pr61839_3, > ssa-pre-4.c are twiddled. I've suppressed phi backwards propagation in > the original test so that it's not compromised. THere's also a variant > of each test which verifies that the transformation is handled by phi > back propagation. > > Bootstrapped and regression tested on x86, both in isolation and in a > larger patch series. > > I see there's a couple whitespace issues in the patch. I'll fix those. > I'd like to get comments/review on the meat of the patch though.
As a general comment I'd note what this patch does is sth that PRE does as well (and quite aggressively so). So I'd say if we kind-of duplicate this transform we should tame it down (see comments below). Given it duplicates functionality I guess that you placed it inside phiprop was mostly due to pass ordering as you mention jump threading can happen earlier? phiopt would have been another natural choice. > > > Jeff > > > > * tree-ssa-phiprop.c (propagate_with_phi_1): Renamed from > propagate_with_phi. > (stmt_likely_produces_constant_result): New function. > (fold_use_into_phi, propagate_with_phi_2): Likewise. > (pass_phiprop::execute): Corresponding changes. Call > propagate_with_phi_2. > > * gcc.dg/tree-ssa/pr61743-1.c: Use -fno-tree-phiprop. > * gcc.dg/tree-ssa/pr61839_1.c: Likewise. > * gcc.dg/tree-ssa/pr61839_3.c: Likewise. > * gcc.dg/tree-ssa/ssa-pre-4.c: Likewise. > > * gcc.dg/tree-ssa/pr61743-1a.c: New test. > * gcc.dg/tree-ssa/pr61839_1a.c: Likewise. > * gcc.dg/tree-ssa/pr61839_3a.c: Likewise. > * gcc.dg/tree-ssa/ssa-pre-4a.c: New test. > > diff --git a/gcc/tree-ssa-phiprop.c b/gcc/tree-ssa-phiprop.c > index 494158b..f2ec2b3 100644 > --- a/gcc/tree-ssa-phiprop.c > +++ b/gcc/tree-ssa-phiprop.c > @@ -259,8 +259,8 @@ chk_uses (tree, tree *idx, void *data) > with aliasing issues as we are moving memory reads. */ > > static bool > -propagate_with_phi (basic_block bb, gphi *phi, struct phiprop_d *phivn, > - size_t n) > +propagate_with_phi_1 (basic_block bb, gphi *phi, struct phiprop_d *phivn, > + size_t n) > { > tree ptr = PHI_RESULT (phi); > gimple *use_stmt; > @@ -440,6 +440,207 @@ next:; > return phi_inserted; > } > > +/* Return TRUE if USE_STMT, which uses PHI_RESULT will likely produce > + a constant result if PHI_RESULT is itself a constant. Else return > + FALSE. > + > + It is safe for this routine to always return TRUE. If the result > + is not a constant it will be detected and properly handled later. > + > + This is overly conservative. If USE_STMT were to produce an SSA_NAME, > + then that would still work for our purposes. */ > + > +static bool > +stmt_likely_produces_constant_result (gimple *use_stmt, tree phi_result) > +{ > + enum tree_code code = gimple_assign_rhs_code (use_stmt); > + switch (TREE_CODE_CLASS (code)) > + { > + /* With few exceptions, a unary operator applied to a constant > + will result in a constant. So they are OK without digging > + deeper into the operator/operands. */ > + case tcc_unary: > + return true; > + > + /* For binary and comparisons we'll generally be able to generate > + a constant if only one operand is an SSA_NAME. */ > + case tcc_binary: > + case tcc_comparison: > + { > + tree rhs1 = gimple_assign_rhs1 (use_stmt); > + tree rhs2 = gimple_assign_rhs2 (use_stmt); > + > + /* We need to check for RES op CONST and CONST op RES. Consider > + MINUS_EXPR or MIN/MAX where RES could be the first or the second > + argument. */ > + if (rhs1 == phi_result > + && TREE_CODE_CLASS (TREE_CODE (rhs2)) == tcc_constant) > + return true; > + > + if (rhs2 == phi_result > + && TREE_CODE_CLASS (TREE_CODE (rhs1)) == tcc_constant) > + return true; > + > + /* We do not try to handle two SSA_NAME operands. */ > + return false; > + } > + > + /* There might be some tcc_reference or tcc_exceptional types we > + could handle with deeper investigation. COND_EXPR comes to mind. */ > + default: > + return false; > + } > +} > + > +/* PHI in block BB produces RESULT. PHI_RESULT is used in USE_STMT. > + > + All of PHIs arguments are simple constants. See if propagating > + the PHI arguments into USE_STMT produces a constant. If that is > + the case for all the PHI arguments, then STMT is replaced with a > + new PHI and we return TRUE. Else make no changes to the IL and > + return FALSE. > + > + When applied this transformation reduces runtime computations and > + replaces them with either constant initializations. This also enables > + jump threading to occur earlier in the optimization pipeline. */ > + > +static bool > +fold_use_into_phi (gphi *phi, gimple *use_stmt, > + basic_block bb, tree phi_result) > +{ > + /* Now verify the use is an assigment to an SSA_NAME. */ > + if (!is_gimple_assign (use_stmt) > + || TREE_CODE (gimple_assign_lhs (use_stmt)) != SSA_NAME) > + return false; > + > + /* If USE_STMT does not likely produce a constant result, then > + do not try to fold this use. Note this is overly conservative > + as we could handle USE_STMT simplifying to an SSA_NAME rather than > + just constants. */ > + if (!stmt_likely_produces_constant_result (use_stmt, phi_result)) > + return false; > + > + /* Build a new PHI node to replace the definition of the USE_STMT's LHS. > */ > + gphi *new_phi = create_phi_node (NULL_TREE, bb); > + > + /* Now fill in its arguments by applying the operator to each > + argument of the original PHI. */ So rather than creating the PHI immediately please do sth like auto_vec<tree, 8> new_args; new_args.reserve_exact (gimple_phi_num_args (phi)); FOR_EACH_EDGE (...) push-to-new-args-vec-or bail out alloc PHI and add args in second sweep over edges. > + edge e; > + edge_iterator ei; > + tree use_result = gimple_assign_lhs (use_stmt); > + tree type = TREE_TYPE (use_result); > + enum tree_code code = gimple_assign_rhs_code (use_stmt); > + FOR_EACH_EDGE (e, ei, bb->preds) > + { > + tree old_arg = PHI_ARG_DEF_FROM_EDGE (phi, e); > + tree new_arg; > + > + switch (TREE_CODE_CLASS (code)) > + { > + case tcc_unary: > + new_arg = fold_unary_to_constant (code, type, old_arg); > + break; > + > + case tcc_binary: > + case tcc_comparison: > + { > + tree rhs1 = gimple_assign_rhs1 (use_stmt); > + tree rhs2 = gimple_assign_rhs2 (use_stmt); > + if (rhs1 == phi_result) > + new_arg = fold_binary (code, type, > + old_arg, gimple_assign_rhs2 (use_stmt)); > + else if (rhs2 == phi_result) > + new_arg = fold_binary (code, type, > + gimple_assign_rhs1 (use_stmt), old_arg); > + else > + gcc_unreachable (); > + break; > + } So I wonder if we don't want to be more aggressive here and not care about PHI arg constant-ness or other operator state of the operation. You should be able to do sth like (with the single-use constraint!) use_operand_p use_p; gimple *use_stmt; single_imm_use (gimple_phi_result (phi), &use_p, &use_stmt); tree orig_arg = USE_FROM_PTR (use_p); in the loop then do *(use_p->use) = phi-arg; // do not use SET_USE to keep immediate uses intact tree res = gimple_fold_stmt_to_constant (USE_STMT (use_p), NULL); if (is_gimple_min_invariant (res)) OK, record it; else { *(use_p->use) = orig_arg; return false; } I suspect the cases where we do have the first N edges to optimize to a constant but a later one will fail the optimization isn't common. As a early bail-out we probably want to avoid this transform on loop headers (thus when backedges are involved). Note that for the case where we have n > 1 edges with constant propagation result it might be profitable to create a forwarder for the constant or non-constant parts thus end up with cst_result_1 = PHI< .... >; goto merge; orig_phi_2 = PHI< edges with non-cst-result... >; noncst_result_3 = OP (orig_phi_2); res_4 = PHI <cst_result_1, noncst_result_3> merge: where of course whether we may hoist OP () needs to be checked. > + default: > + gcc_unreachable (); > + } > + > + /* We did not get a constant. This can happen (imagine division > + by zero). We have to remove the PHI from the IL, as the PHI > + is only partially constructed. */ > + if (!new_arg) > + { > + /* Set the PHI_RESULT to a new, throw-away SSA_NAME. This avoids > + problems unlinking the "uses" of a currently NULL PHI_RESULT. */ > + gimple_phi_set_result (new_phi, make_ssa_name (type)); > + > + /* Actually remove the PHI from the IL. It is safe to remove the > + PHI if some of its PHI arguments are still uninitialized. */ > + gimple_stmt_iterator gsi = gsi_for_stmt (new_phi); > + remove_phi_node (&gsi, true); > + return false; > + } > + > + source_location locus = gimple_phi_arg_location_from_edge (phi, e); > + add_phi_arg (new_phi, new_arg, e, locus); > + } > + > + gimple_phi_set_result (new_phi, use_result); > + > + /* At this point we have our new PHI installed where we want it. > + Delete the old PHI and USE_STMT. */ > + gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt); > + gsi_remove (&gsi, true); > + return true; > +} > + > +/* Look for sequences like this: > + > + # iftmp.4_16 = PHI <0(7), 1(8), 0(9)> > + _30 = (unsigned char) iftmp.4_16; > + > + We can "hoist" the conversion into the PHI and get it for free, > generating: > + > + # iftmp.4_16 = PHI <0(7), 1(8), 0(9)> > + # _30 = PHI <0(7), 1(8), 0(9)> > + > + DCE will eliminate the first PHI. In addition to getting the conversion > + for free, removal of the conversion improves backwards threading. */ > + > +static bool > +propagate_with_phi_2 (basic_block bb, gphi *phi) > +{ > + /* First verify that all the arguments of the PHI are constants. > + > + This is overly conservative. Consider: > + > + y = PHI (x, -1, 0); > + z = y & x; > + > + We could transform that into: > + > + y = PHI (x, -1, 0); > + z = PHI (x, x, 0); > + > + It's not clear how important those cases are and we punt them > + for now. */ > + use_operand_p arg_p; > + ssa_op_iter i; > + FOR_EACH_PHI_ARG (arg_p, phi, i, SSA_OP_USE) > + { > + tree arg = USE_FROM_PTR (arg_p); > + if (TREE_CODE_CLASS (TREE_CODE (arg)) != tcc_constant) > + return false; > + } > + > + /* Not iterate over each immediate use to see if the statement can > + be implemented as a PHI rather than a real statement. */ > + gimple *use_stmt; > + imm_use_iterator ui; > + bool retval = false; > + tree phi_result = PHI_RESULT (phi); > + FOR_EACH_IMM_USE_STMT (use_stmt, ui, phi_result) > + retval |= fold_use_into_phi (phi, use_stmt, bb, phi_result); So you say DCE will remove the PHI but then you handle multiple uses where some of them might fail to transform. Thus should we restrict this to single-uses? That way we also avoid to end up with too many PHIs and we can remove the original PHI immediately. > + return retval; > +} > + > /* Main entry for phiprop pass. */ > > namespace { > @@ -492,7 +693,10 @@ pass_phiprop::execute (function *fun) > single_succ (ENTRY_BLOCK_PTR_FOR_FN (fun))); > FOR_EACH_VEC_ELT (bbs, i, bb) > for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi)) > - did_something |= propagate_with_phi (bb, gsi.phi (), phivn, n); > + { > + did_something |= propagate_with_phi_1 (bb, gsi.phi (), phivn, n); > + did_something |= propagate_with_phi_2 (bb, gsi.phi ()); > + } > > if (did_something) > gsi_commit_edge_inserts (); > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr61743-1.c > b/gcc/testsuite/gcc.dg/tree-ssa/pr61743-1.c > index a5c83cf..7771044 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/pr61743-1.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr61743-1.c > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O3 -funroll-loops -fno-tree-vectorize > -fdump-tree-cunroll-details -fdump-tree-cunrolli-details -fno-peel-loops" } */ > +/* { dg-options "-O3 -fno-tree-phiprop -funroll-loops -fno-tree-vectorize > -fdump-tree-cunroll-details -fdump-tree-cunrolli-details -fno-peel-loops" } */ > > #define N 8 > #define M 14 > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr61743-1a.c > b/gcc/testsuite/gcc.dg/tree-ssa/pr61743-1a.c > new file mode 100644 > index 0000000..dd64472 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr61743-1a.c > @@ -0,0 +1,10 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -funroll-loops -fno-tree-vectorize > -fdump-tree-cunroll-details -fdump-tree-cunrolli-details -fdump-tree-phiprop > -fno-peel-loops" } */ > + > +#include "pr61743-1.c" > + > +/* { dg-final { scan-tree-dump-times "PHI <(\?:320|384|448)\\(\[0-9\]+\\), > (\?:320|384|448)\\(\[0-9\]+\\), (\?:320|384|448)\\(\[0-9\]+\\)>" 1 "phiprop"} > } */ > + > +/* The code with PHI back propagation is simpler and we unswitch more. */ > +/* { dg-final { scan-tree-dump-times "loop with 4 iterations completely > unrolled" 10 "cunroll" } } */ > +/* { dg-final { scan-tree-dump-times "loop with 9 iterations completely > unrolled" 2 "cunrolli" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_1.c > b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_1.c > index 9f8168a..5bf1982 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_1.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_1.c > @@ -1,6 +1,6 @@ > /* PR tree-optimization/61839. */ > /* { dg-do run } */ > -/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-optimized" } */ > +/* { dg-options "-O2 -fdump-tree-vrp1 -fno-tree-phiprop > -fdump-tree-optimized" } */ > /* { dg-require-effective-target int32plus } */ > > __attribute__ ((noinline)) > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_1a.c > b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_1a.c > new file mode 100644 > index 0000000..99a9016 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_1a.c > @@ -0,0 +1,11 @@ > +/* PR tree-optimization/61839. */ > +/* { dg-do run } */ > +/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-phiprop" } */ > +/* { dg-require-effective-target int32plus } */ > + > +#include "pr61839_1.c" > + > +/* Scan for c = 972195717) >> [0, 1] in function foo. */ > +/* { dg-final { scan-tree-dump-times "486097858 : 972195717" 1 "vrp1" } } */ > +/* Scan for c = 972195717) >> [2, 3] in function bar. */ > +/* { dg-final { scan-tree-dump-times "PHI > <(?:243048929|121524464)\\(\[0-9\]+\\), > (?:243048929|121524464)\\(\[0-9\]+\\)>" 1 "phiprop"} } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3.c > b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3.c > index 5ceb073..ad908c0 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3.c > @@ -1,6 +1,6 @@ > /* PR tree-optimization/61839. */ > /* { dg-do run } */ > -/* { dg-options "-O2 -fdump-tree-vrp1 -fdump-tree-optimized" } */ > +/* { dg-options "-O2 -fno-tree-phiprop -fdump-tree-vrp1 > -fdump-tree-optimized" } */ > > __attribute__ ((noinline)) > int foo (int a, unsigned b) > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3a.c > b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3a.c > new file mode 100644 > index 0000000..3a4d800 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3a.c > @@ -0,0 +1,9 @@ > +/* PR tree-optimization/61839. */ > +/* { dg-do run } */ > +/* { dg-options "-O2 -fdump-tree-phiprop -fdump-tree-optimized" } */ > +#include "pr61839_3.c" > + > +/* Scan for a PHI with (12, 13) << 8 in function foo. */ > +/* { dg-final { scan-tree-dump-times "PHI <\(3072|3328\)\\(\[0-9\]+\\), > \(3072|3328\)\\(\[0-9\ > +]+\\)>" 1 "phiprop"} } */ > +/* { dg-final { scan-tree-dump-times "3072" 0 "optimized" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-4.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-4.c > index 5b4fd54..0bc5f73 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-4.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-4.c > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2 -fdump-tree-pre-stats" } */ > +/* { dg-options "-O2 -fno-tree-phiprop -fdump-tree-pre-stats" } */ > int foo(void) > { > int x, c, y; > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-4a.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-4a.c > new file mode 100644 > index 0000000..70e8b16 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-4a.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-phiprop" } */ > +#include "ssa-pre-4.c" > +/* We should eliminate the x+1 computation from this routine, replacing > + it with a phi of 3, 4 */ > +/* { dg-final { scan-tree-dump-times "PHI <\[34\]\\(\[0-9\]+\\), > \[34\]\\(\[0-9\]+\\)>" 1 "phiprop"} } */ >