-----Original Message----- From: Jeff Law [mailto:l...@redhat.com] Sent: Friday, May 29, 2015 9:24 PM To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE
On 05/16/2015 06:49 AM, Ajit Kumar Agarwal wrote: > I have Designed and implemented with the following design for the path > splitting of the loops with conditional IF-THEN-ELSE. > The implementation has gone through the bootstrap for Microblaze > target along DEJA GNU regressions tests and running the MIBench/EEMBC > benchmarks. There is no regression seen in Deja GNU tests and Deja C++ > tests results are better with the path splitting changes(lesser > failures). The Mibench/EEMBC benchmarks are run and no performance > degradation is seen and the performance improvement in telcom_gsm( > Mibench benchmarks) of 9.15% and rgbcmy01_lite(EEMBC > benchmarks) the performance improvement of 9.4% is observed. > > Proposal on the below were made earlier on path splitting and the reference > is attached. >>The first thing I notice is you haven't included any tests for the new >>optimization. We're trying really hard these days to have tests in the suite >>for this kind of >>new optimization/feature work. >>I don't offhand know if any of the benchmarks you cite above are free-enough >>to derive a testcase from. But one trick many of us use is to instrument the >>>>pass and compile some known free software (often gcc >>itself) to find triggering code and use that to generate tests for the new >>transformation. I will add tests in the suite. I could see many existing tests in the suite also get triggered with this optimization. > > Design changes. > > 1. The dominators of the block with conditional IF statements say BB1 > are found and the join node of the IF-THEN-ELSE inside the loops is found on > the blocks dominated by the BB1 and are not successor of BB1 are the join > node. This isn't necessarily my area of strength, but it doesn't seem right to me. I guess it would work if there aren't any more nodes in the loop after the join block, except for the loop latch, but it doesn't see right in the general case. It might work if you also verify that the IF block is the immediate dominator of potential join node. > > 2. The Join node is same as the source of the loops latch basic blocks. >>OK. This is why your initial algorithm in #1 works. Thanks. > > 3. If the above conditional in (1) and (2) are found the Join node > same as the source of the Loop latch node is moved into predecessors > and the Join node ( Source of the Loop latch node) is made empty statement > block with only the phi nodes. >>Hopefully you do this by duplicating the join node and manipulating the CFG >>so that the original predecessors of the join each go to their copy of the >>join >>node. The copied join nodes unconditionally then pass control to the >>original join node. Yes. > > 4. In the process of moving the Join node into its predecessors the > result of the phi node in the Join node propagated with the corresponding phi > arguments based on which predecessor it came from in the Join blocks and > move into its predecessors. >>Right. Note there's a phi-only cprop pass which is designed to do this >>propagation and in theory should be very fast. Thanks. > 5. The Dominator INFO is updated after performing the steps of (1) (2) (3) > and (4). >>Right. Getting the sequencing of the various updates can be tricky. >>Even more so if you try to do it as a part of jump threading because you have >>to (for example) deal with unreachable blocks in the CFG. > > 6. The Join which is same as the source of the Loop latch node is made empty > with only the phi node in the Join node. >>Right. Thanks. > > 7. The implementation is done in Jump threading phase of the machine > independent optimization on tree based > representation. The path splitting is called after the Jump threading > optimization is performed. > The following files are modifed. > > a) tree-vrp.c > b) tree-ssa-threadedge.c > c) tree-cfg.c > d) tree-ssa-threadedge.h > e) tree-cfg.h > f) cfghooks.c > > The diff is attached along with this mail and pasted below. >>I would recommend this as a separate pass. One of the goals for GCC 6 >>is to break out the threader into its own pass, if you're piggybacking >>on the threader it just makes this task harder. >>Also by implementing this as a distinct pass you have more freedom to >>put it where it really belongs in the optimization pipeline. Given the >>duplication may end up exposing partial redundancies, dead code & >>propagation opportunities, it's probably best placed before DOM. That >>also eliminates the need for running phi-only cprop because DOM will >>take care of it for you. >>When jump threading is broken out into its own distinct pass, it'll >>probably end up just before or just after your path splitting pass. Thanks for valuable suggestion. I will make the Path Splitting optimization as a separate pass. > > Please share your thoughts and feedbacks on the above optimization and the > design and the coding and implementation > done for the given above design. >>Some general notes. Every function should have a block comment which >>describes what the function does, its arguments & return value. There >>are many examples throughout the code you can use to understand what >>we're looking for in those function comments. >>Those comments are particularly useful during initial review and later >>maintenance, so in the future, please include them, even on RFCs. >>You'll need to write a ChangeLog entry. It's not particularly exciting >>or fun to do, but I regularly find that writing the ChangeLog forces me >>to look at each hunk in the patch and re-justify its purpose. It's >>common that in doing so I rework the patch because I see something wrong >>or a cleaner/better way to solve a particular problem. I will make sure now onwards new function should be commented and the ChangeLog Entry will be provided. > > diff --git a/gcc/cfghooks.c b/gcc/cfghooks.c > index 9faa339..559ca96 100644 > --- a/gcc/cfghooks.c > +++ b/gcc/cfghooks.c > @@ -581,7 +581,7 @@ delete_basic_block (basic_block bb) > > /* If we remove the header or the latch of a loop, mark the loop for > removal. */ > - if (loop->latch == bb > + if (loop && loop->latch == bb > || loop->header == bb) > mark_loop_for_removal (loop); >>So what caused you to add this additional test? In general loop >>structures are supposed to always be available. The change here implies >>that the loop structures were gone at some point. That seems at first >>glance a mistake. I was using the gimple_duplicate_bb which will not add the duplicate basic block inside the current_loops. That's why the above Condition is required. I am using duplicate_block instead of gimple_duplicate_bb. With this change the above check with loop Is not required as it adds the duplicate basic block inside the loops. > > diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c > index aed5254..b25e409 100644 > --- a/gcc/tree-cfg.c > +++ b/gcc/tree-cfg.c > @@ -1838,6 +1838,64 @@ replace_uses_by (tree name, tree val) > } > } > > +void > +gimple_threaded_merge_blocks (basic_block a, basic_block b) If we keep this function it will need a block comment. >>I say "if" for a couple reasons. First we already have support routines >>that know how to merge blocks. If you really need to merge blocks you >>should try to use them. >>Second, I'm not sure that you really need to worry about block merging >>in this pass. Just create the duplicates, wire them into the CFG and >>let the existing block merging support handle this problem. The above routine is not merging but duplicates the join nodes into its predecessors. If I change the name of the above Function to the gimple_threaded_duplicating_join_node it should be fine. > > static void > @@ -2061,7 +2119,7 @@ remove_bb (basic_block bb) > > /* If a loop gets removed, clean up the information associated > with it. */ > - if (loop->latch == bb > + if (loop && loop->latch == bb >>Same concern here as earlier, why don't you have the loop structures >>available when this routine gets called? Same reasons as earlier. > || loop->header == bb) > free_numbers_of_iterations_estimates_loop (loop); > } > @@ -5779,7 +5837,7 @@ gimple_can_duplicate_bb_p (const_basic_block bb > ATTRIBUTE_UNUSED) > /* Create a duplicate of the basic block BB. NOTE: This does not > preserve SSA form. */ > > -static basic_block > +basic_block > gimple_duplicate_bb (basic_block bb) >>There's already interfaces for duplicating blocks that you can/should be >>using. In particular "duplicate_block" or "gimple_duplicate_sese_region". Thanks. > { > basic_block new_bb; > @@ -5858,7 +5916,7 @@ gimple_duplicate_bb (basic_block bb) > > /* Adds phi node arguments for edge E_COPY after basic block duplication. > */ > > -static void > +void > add_phi_args_after_copy_edge (edge e_copy) >>I don't think you'll need this if you use the existing interfaces for >>block duplication. Yes this is not required if we use the duplicate_block, the existing interface. > diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c > index 4303a18..2c7d36d 100644 > --- a/gcc/tree-ssa-threadedge.c > +++ b/gcc/tree-ssa-threadedge.c > @@ -1359,6 +1359,322 @@ thread_through_normal_block (edge e, > return 0; > } > > +static void > +replace_threaded_uses (basic_block a,basic_block b) >>If you keep this function, then it'll need a function comment. >>It looks like this is just doing const/copy propagation. I think a >>better structure is to implement your optimization as a distinct pass, >>then rely on existing passes such as update_ssa, DOM, CCP to handle >>updating the SSA graph and propagation opportunities exposed by your >>transformation. >>Similarly for the other replace_ functions. I think these replace_ functions are required as existing Dom, CCP and propagation opportunities doesn't transform these Propagation given below. <bb 29>: xk_124 = MIN_EXPR <xy_123, xc_121>; xc_126 = xc_121 - xk_6; xm_127 = xm_122 - xk_6; xy_128 = xy_123 - xk_6; *EritePtr_14 = xc_126; MEM[(Byte *)EritePtr_14 + 1B] = xm_127; MEM[(Byte *)EritePtr_14 + 2B] = xy_128; EritePtr_135 = &MEM[(void *)EritePtr_14 + 4B]; MEM[(Byte *)EritePtr_14 + 3B] = xk_6; i_137 = i_4 + 1; goto <bb 31>; <bb 30>: xk_125 = MIN_EXPR <xy_123, xm_122>; xc_165 = xc_121 - xk_6; xm_166 = xm_122 - xk_6; xy_167 = xy_123 - xk_6; *EritePtr_14 = xc_126; MEM[(Byte *)EritePtr_14 + 1B] = xm_127; MEM[(Byte *)EritePtr_14 + 2B] = xy_128; EritePtr_171 = &MEM[(void *)EritePtr_14 + 4B]; MEM[(Byte *)EritePtr_14 + 3B] = xk_6; i_173 = i_4 + 1; <bb 29> and <bb 30 > are the predecessors of the join node. After the duplication of join node and the duplicating the Join node into the predecessors < bb 29> and <bb 30>, the above is the gimple representations. } But the < bb 30 > should be the following after the transformation. <bb 30>: xk_125 = MIN_EXPR <xy_123, xm_122>; xc_165 = xc_121 - xk_125; xm_166 = xm_122 - xk_125; xy_167 = xy_123 - xk_125; *EritePtr_14 = xc_165; MEM[(Byte *)EritePtr_14 + 1B] = xm_166; MEM[(Byte *)EritePtr_14 + 2B] = xy_167; EritePtr_171 = &MEM[(void *)EritePtr_14 + 4B]; MEM[(Byte *)EritePtr_14 + 3B] = xk_125; i_173 = i_4 + 1; The phi-only cprop will replace the x_6 with xk_125 but the other replacements like xk_126, xk_127, xk_128 with Xk_165, xk_166, xk_167 will not be done by the existing routine like phi-only-cprop as this not only copy propagation because the duplicate block defines New results but the copy is still old with the join node. The replace function does the above transformation and I think it is required. > + > +static bool > +is_def_stmt_assert_expr (basic_block src1, basic_block bb, > + int pred_index) >>As a distinct pass I don't think you'd need to handle ASSERT_EXPRs as >>they're removed after VRP has completed. THe only reason you're running >>into them is because the VRP runs the jump threader as a subroutine. >>And it's odd that you also check for MIN_EXPR, I don't see how that's >>useful. And yet you don't check for MAX_EXPR, so something definitely >>seems odd here. + > +void > +process_path_splitting (basic_block bb) Function comment. This seems to be the meat of the transformation. > +{ > + vec<basic_block> bbs; > + basic_block bb1; > + unsigned int i; > + edge_iterator ei; > + edge e1; > + bool found = false ,found1; > + > + bbs = get_all_dominated_blocks (CDI_DOMINATORS, > + bb ); > + FOR_EACH_VEC_ELT (bbs, i, bb1) > + { > + found1 = false; > + FOR_EACH_EDGE (e1, ei, bb->succs) > + { > + if (bb1 == e1->dest) > + { > + found = true; > + found1 = true; > + } > + } > + if (!found1 && found) > + { > + unsigned int num_preds = 0; > + found = false; > + > + FOR_EACH_EDGE (e1, ei, bb1->succs) > + { > + if (e1->flags & (EDGE_DFS_BACK)) > + found = true; > + } > + > + if (found && EDGE_COUNT(bb1->preds) == 2) > + { > + basic_block temp_bb = bb1; > + basic_block src1,src2; > + unsigned int k = 0, num_stmt = 0, num_phis = 0; > + bool is_src1 = false, is_src2 = false; > + gimple_stmt_iterator psi,gsi; > + > + FOR_EACH_EDGE (e1, ei, bb1->preds) > + { > + if ((e1->flags & (EDGE_DFS_BACK))) > + continue; > + if (k == 1) > + { > + if (single_succ_p(e1->src) && > + single_succ_edge (e1->src)->flags & EDGE_FALLTHRU) > + { > + src2 = e1->src; > + is_src2 = true; > + } > + } > + else > + { > + if (single_succ_p(e1->src) && > + single_succ_edge (e1->src)->flags & EDGE_FALLTHRU) > + { > + is_src1 = true; > + src1 = e1->src; > + } > + } > + k++; > + } > + for (gsi = gsi_start_bb (bb1); !gsi_end_p (gsi); gsi_next (&gsi)) > + num_stmt++; > + > + if ((num_stmt > 1) && is_src1 && is_src2) > + { > + bool found_non_virtual_result = false; > + > + for (psi = gsi_start_phis (bb1); !gsi_end_p (psi); ) > + { > + gimple stmt = gsi_stmt (psi); > + > + if (!virtual_operand_p (gimple_phi_result (stmt))) > + num_phis++; > + else > + found_non_virtual_result = true; > + > + gsi_next(&psi); > + } > + > + if ((num_phis >1) || found_non_virtual_result) > + return; > + > + if (!(is_def_stmt_assert_expr (src1, bb1, 1) && > + is_def_stmt_assert_expr (src2, bb1, 2))) > + return; >>So up to this point is just identification of candidates. Seems like it >>ought to factor into its own function. >>And from here to the end of the function is the actual transformation, >>which feels like it should factor into its own function as well. Thanks. I will refactor the above code. > + > + temp_bb = gimple_duplicate_bb (bb1); > + > + replace_threaded_uses (bb1, temp_bb); > + > + make_edge (src1, temp_bb, EDGE_FALLTHRU); > + > + make_edge (src2,temp_bb,EDGE_FALLTHRU); > + > + FOR_EACH_EDGE (e1, ei, temp_bb->preds) > + add_phi_args_after_copy_edge (e1); > + > + replace_phis(temp_bb, bb1); > + > + replace_uses_phi(bb1, 1); > + > + gimple_threaded_merge_blocks (src1, bb1); > + > + replace_uses_phi(temp_bb, 2); > + > + gimple_threaded_merge_blocks (src2, temp_bb); > + > + while (EDGE_COUNT (temp_bb->preds) != 0) > + remove_edge (EDGE_PRED (temp_bb, 0)); > + > + while (EDGE_COUNT (temp_bb->succs) != 0) > + remove_edge (EDGE_SUCC (temp_bb, 0)); > + > + if (dom_info_available_p (CDI_DOMINATORS)) > + delete_from_dominance_info (CDI_DOMINATORS, temp_bb); > + > + if (dom_info_available_p (CDI_POST_DOMINATORS)) > + delete_from_dominance_info (CDI_POST_DOMINATORS, temp_bb); > + > + remove_phi_nodes (temp_bb); > + expunge_block (temp_bb); > + > + return; > + } > + } > + } > + } > +} > + > diff --git a/gcc/tree-vrp.c b/gcc/tree-vrp.c > index fbecdf7..c0db184 100644 > --- a/gcc/tree-vrp.c > +++ b/gcc/tree-vrp.c > @@ -10141,6 +10141,10 @@ identify_jump_threads (void) > thread_across_edge (dummy, e, true, &equiv_stack, > simplify_stmt_for_jump_threading); > } > + if (gimple_code(last) == GIMPLE_COND) > + { > + process_path_splitting(bb); >>So instead of running this as a subroutine of vrp, make it a first class >>pass. Then find the right place to put the pass. I think that'll >>ultimately be cleaner and simpler as you won't be trying to duplicate >>things like const/copy propagation. >>Overall I like the the transformation you're trying to make, but would >>like to see this as a separate pass, if it can't be a separate pass, >>please indicate why. Second, you should be looking to re-use existing >>block duplication routines, SSA updating, etc as much as possible. Thanks for valuable suggestion and feedback. I have already made as a separate pass and place this Optimization before the DOM OPT. The changes are working fine with a separate pass. Thanks & Regards Ajit jeff